Download - Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Transcript
Page 1: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Spark  Making  Big  Data  Analytics  Interactive  and  Real-­‐Time  

Matei  Zaharia,  in  collaboration  with  Mosharaf  Chowdhury,  Tathagata  Das,  Timothy  Hunter,  Ankur  Dave,  Haoyuan  Li,  Justin  Ma,  Murphy  McCauley,  Michael  Franklin,  Scott  Shenker,  Ion  Stoica    spark-­‐project.org  

UC  BERKELEY  

Page 2: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Overview  Spark  is  a  parallel  framework  that  provides:  » Efficient  primitives  for  in-­‐memory  data  sharing  » Simple  APIs  in  Scala,  Java,  SQL  » High  generality  (applicable  to  many  emerging  problems)  

This  talk  will  cover:  » What  it  does  » How  people  are  using  it  (including  some  surprises)  » Current  research  

Page 3: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Motivation  MapReduce  simplified  data  analysis  on  large,  unreliable  clusters  

But  as  soon  as  it  got  popular,  users  wanted  more:  » More  complex,  multi-­‐pass  applications  (e.g.  machine  learning,  graph  algorithms)  » More  interactive  ad-­‐hoc  queries  » More  real-­‐time  stream  processing  

One  reaction:  specialized  models  for  some  of  these  apps  (e.g.  Pregel,  Storm)  

Page 4: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Motivation  Complex  apps,  streaming,  and  interactive  queries  all  need  one  thing  that  MapReduce  lacks:  

Efficient  primitives  for  data  sharing    

Page 5: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Examples  

iter.  1   iter.  2   .    .    .  

Input  

HDFS  read  

HDFS  write  

HDFS  read  

HDFS  write  

Input  

query  1  

query  2  

query  3  

result  1  

result  2  

result  3  

.    .    .  

HDFS  read  

Slow  due  to  replication  and  disk  I/O,  but  necessary  for  fault  tolerance  

Page 6: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

iter.  1   iter.  2   .    .    .  

Input  

Goal:  Sharing  at  Memory  Speed  

Input  

query  1  

query  2  

query  3  

.    .    .  

one-­‐time  processing  

10-­‐100×  faster  than  network/disk,  but  how  to  get  FT?  

Page 7: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Challenge  

How  to  design  a  distributed  memory  abstraction  that  is  both  fault-­‐tolerant  and  efficient?  

Page 8: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Existing  Systems  Existing  in-­‐memory  storage  systems  have  interfaces  based  on  fine-­‐grained  updates  » Reads  and  writes  to  cells  in  a  table  » E.g.  databases,  key-­‐value  stores,  distributed  memory  

Requires  replicating  data  or  logs  across  nodes  for  fault  tolerance  è  expensive!  » 10-­‐100x  slower  than  memory  write…  

Page 9: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Solution:  Resilient  Distributed  Datasets  (RDDs)  

Provide  an  interface  based  on  coarse-­‐grained  operations  (map,  group-­‐by,  join,  …)  

Efficient  fault  recovery  using  lineage  » Log  one  operation  to  apply  to  many  elements  » Recompute  lost  partitions  on  failure  » No  cost  if  nothing  fails  

Page 10: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Input  

query  1  

query  2  

query  3  

.    .    .  

RDD  Recovery  

one-­‐time  processing  

iter.  1   iter.  2   .    .    .  

Input  

Page 11: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Generality  of  RDDs  RDDs  can  express  surprisingly  many  parallel  algorithms  » These  naturally  apply  same  operation  to  many  items  

Capture  many  current  programming  models  » Data  flow  models:  MapReduce,  Dryad,  SQL,  …  » Specialized  models  for  iterative  apps:  Pregel,  iterative  MapReduce,  PowerGraph,  …  

Allow  these  models  to  be  composed  

Page 12: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Outline  Programming  interface  

Examples  

User  applications  

Implementation  

Demo  

Current  research:  Spark  Streaming  

Page 13: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Spark  Programming  Interface  

Language-­‐integrated  API  in  Scala*  

Provides:  » Resilient  distributed  datasets  (RDDs)  •  Partitioned  collections  with  controllable  caching  

» Operations  on  RDDs  •  Transformations  (define  RDDs),  actions  (compute  results)  

» Restricted  shared  variables  (broadcast,  accumulators)  

*Also  Java,  SQL  and  soon  Python  

Page 14: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Example:  Log  Mining  Load  error  messages  from  a  log  into  memory,  then  interactively  search  for  various  patterns  

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.persist()

Block  1  

Block  2  

Block  3  

Worker  

Worker  

Worker  

Driver  

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks  

results  

Msgs.  1  

Msgs.  2  

Msgs.  3  

Base  RDD  Transformed  RDD  

Action  

Result:  full-­‐text  search  of  Wikipedia  in  <1  sec  (vs  20  sec  for  on-­‐disk  data)  Result:  scaled  to  1  TB  data  in  5-­‐7  sec  

(vs  170  sec  for  on-­‐disk  data)  

Page 15: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

RDDs  track  the  graph  of  transformations  that  built  them  (their  lineage)  to  rebuild  lost  data  

E.g.:  

 

 

messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDD    

path  =  hdfs://…  

FilteredRDD    

func  =  _.contains(...)  

MappedRDD    

func  =  _.split(…)  

Fault  Recovery  

Page 16: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Fault  Recovery  Results  

119  

57   56   58   58  81  

57   59   57   59  

0  20  40  60  80  100  120  140  

1   2   3   4   5   6   7   8   9   10  

Iteratrion

 time  (s)  

Iteration  

Failure  happens  

Page 17: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Example:  Logistic  Regression  

Goal:  find  best  line  separating  two  sets  of  points  

+

+ + +

+

+

+ + +

– – –

– – –

+

target  

random  initial  line  

Page 18: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Example:  Logistic  Regression  val data = spark.textFile(...).map(readPoint).persist() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Page 19: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Logistic  Regression  Performance  

110  s  /  iteration  

first  iteration  80  s  further  iterations  6  s  0  

10  

20  

30  

40  

50  

60  

1   10   20   30  

Run

ning

 Tim

e  (m

in)  

Number  of  Iterations  

Hadoop  

Spark  

Page 20: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Example:  Collaborative  Filtering  

Goal:  predict  users’  movie  ratings  based  on  past  ratings  of  other  movies  

R  =  

1  ?  ?  4  5  ?  3  ?  ?  3  5  ?  ?  3  5  ?  5  ?  ?  ?  1  4  ?  ?  ?  ?  2  ?  

Movies  

Users  

Page 21: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Model  and  Algorithm  Model  R  as  product  of  user  and  movie  feature  matrices  A  and  B  of  size  U×K  and  M×K  

Alternating  Least  Squares  (ALS)  » Start  with  random  A  &  B  » Optimize  user  vectors  (A)  based  on  movies  » Optimize  movie  vectors  (B)  based  on  users  » Repeat  until  converged  

R   A  =  BT  

Page 22: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Serial  ALS  var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateUser(i, B, R)) B = (0 until M).map(i => updateMovie(i, A, R)) }

Range  objects  

Page 23: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Naïve  Spark  ALS  var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R)) .collect() }

Problem:  R  re-­‐sent  

to  all  nodes  in  each  iteration  

Page 24: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Efficient  Spark  ALS  var R = spark.broadcast(readRatingsMatrix(...)) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R.value)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R.value)) .collect() }

Solution:  mark  R  as  broadcast  variable  

Result:  3×  performance  improvement  

Page 25: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Scaling  Up  Broadcast  Initial  version  (HDFS)   Cornet  P2P  broadcast  

0  

50  

100  

150  

200  

250  

10   30   60   90  

Iteration  time  (s)  

Number  of  machines  

Communication  Computation  

0  

50  

100  

150  

200  

250  

10   30   60   90  

Iteration  time  (s)  

Number  of  machines  

Communication  Computation  

[Chowdhury  et  al,  SIGCOMM  2011]  

Page 26: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Other  RDD  Operations  

Transformations  (define  a  new  RDD)  

map  filter  

sample  groupByKey  reduceByKey  sortByKey  

flatMap  union  join  

cogroup  cross  …  

Actions  (return  a  result  to  driver  program)  

collect  reduce  count  save  …  

Page 27: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Spark  in  Java  lines.filter(_.contains(“error”)).count()

JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();

Page 28: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Spark  in  Python  (Coming  Soon!)  

lines = sc.textFile(sys.argv[1]) counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x, y: x + y)

Page 29: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Outline  Programming  interface  

Examples  

User  applications  

Implementation  

Demo  

Current  research:  Spark  Streaming  

Page 30: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Spark  Users  

400+  user  meetup,  20+  contributors  

Page 31: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

User  Applications  Crowdsourced  traffic  estimation  (Mobile  Millennium)  

Video  analytics  &  anomaly  detection  (Conviva)  

Ad-­‐hoc  queries  from  web  app  (Quantifind)  

Twitter  spam  classification  (Monarch)  

DNA  sequence  analysis  (SNAP)  

…  

Page 32: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Mobile  Millennium  Project  Estimate  city  traffic  from  GPS-­‐equipped  vehicles  (e.g.  SF  taxis)  

Page 33: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Sample  Data  

Credit:  Tim  Hunter,  with  support  of  the  Mobile  Millennium  team;  P.I.  Alex  Bayen;  traffic.berkeley.edu  

Page 34: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Challenge  Data  is  noisy  and  sparse  (1  sample/minute)  

Must  infer  path  taken  by  each  vehicle  in  addition  to  travel  time  distribution  on  each  link  

Page 35: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Challenge  Data  is  noisy  and  sparse  (1  sample/minute)  

Must  infer  path  taken  by  each  vehicle  in  addition  to  travel  time  distribution  on  each  link  

Page 36: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Solution  EM  algorithm  to  estimate  paths  and  travel  time  distributions  simultaneously  

observations  

weighted  path  samples  

link  parameters  

flatMap  

groupByKey  

broadcast  

Page 37: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Results  

3×  speedup  from  caching,  4.5x  from  broadcast  

[Hunter  et  al,  SOCC  2011]  

Page 38: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Conviva  GeoReport  

SQL  aggregations  on  many  keys  w/  same  filter  

40×  gain  over  Hive  from  avoiding  repeated  I/O,  deserialization  and  filtering  

0.5  

20  

0   5   10   15   20  

Spark  

Hive  

Time  (hours)  

Page 39: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Other  Programming  Models  

Pregel  on  Spark  (Bagel)  » 200  lines  of  code  

Iterative  MapReduce  » 200  lines  of  code  

Hive  on  Spark  (Shark)  » 5000  lines  of  code  » Compatible  with  Apache  Hive  » Machine  learning  ops.  in  Scala  

Page 40: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Outline  Programming  interface  

Examples  

User  applications  

Implementation  

Demo  

Current  research:  Spark  Streaming  

Page 41: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Implementation  Runs  on  Apache  Mesos  cluster  manager  to  coexist  w/  Hadoop  

Supports  any  Hadoop  storage  system  (HDFS,  HBase,  …)  

Spark   Hadoop   MPI  

Mesos  

Node   Node   Node  

…  

Easy  local  mode  and  EC2  launch  scripts  

No  changes  to  Scala  

Page 42: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Task  Scheduler  Runs  general  DAGs  

Pipelines  functions  within  a  stage  

Cache-­‐aware  data  reuse  &  locality  

Partitioning-­‐aware  to  avoid  shuffles  

join  

union  

groupBy  

map  

Stage  3  

Stage  1  

Stage  2  

A:   B:  

C:   D:  

E:  

F:  

G:  

=  cached  data  partition  

Page 43: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Language  Integration  Scala  closures  are  Serializable  Java  objects  » Serialize  on  master,  load  &  run  on  workers  

Not  quite  enough  » Nested  closures  may  reference  entire  outer  scope,  pulling  in  non-­‐Serializable  variables  not  used  inside  » Solution:  bytecode  analysis  +  reflection  

 

Page 44: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Interactive  Spark  Modified  Scala  interpreter  to  allow  Spark  to  be  used  interactively  from  the  command  line  » Track  variables  that  each  line  depends  on  » Ship  generated  classes  to  workers  

Enables  in-­‐memory  exploration  of  big  data  

Page 45: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Outline  Programming  interface  

Examples  

User  applications  

Implementation  

Demo  

Current  research:  Spark  Streaming  

Page 46: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Outline  Programming  interface  

Examples  

User  applications  

Implementation  

Demo  

Current  research:  Spark  Streaming  

Page 47: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Many  “big  data”  apps  need  to  work  in  real  time  » Site  statistics,  spam  filtering,  intrusion  detection,  …  

To  scale  to  100s  of  nodes,  need:  » Fault-­‐tolerance:  for  both  crashes  and  stragglers  » Efficiency:  don’t  consume  many  resources  beyond  base  processing  

Challenging  in  existing  streaming  systems  

Motivation  

Page 48: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Traditional  Streaming  Systems  

Continuous  processing  model  » Each  node  has  long-­‐lived  state  » For  each  record,  update  state  &  send  new  records  

mutable  state  

node  1  

node  3  

input  records   push  

node  2  input  records  

Page 49: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Traditional  Streaming  Systems  Fault  tolerance  via  replication  or  upstream  backup:  

node  1  

node  2  

node  1’  

node  3’  

node  2’  

synchronization  

node  1  

node  3  

node  2  

standby  

input  

input  

input  

input  

node  3  

Page 50: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Traditional  Streaming  Systems  Fault  tolerance  via  replication  or  upstream  backup:  

node  1  

node  3  

node  2  

node  1’  

node  3’  

node  2’  

synchronization  

node  1  

node  3  

node  2  

standby  

input  

input  

input  

input  

Fast  recovery,  but  2x  hardware  cost  

Only  need  1  standby,  but  slow  to  recover  

Page 51: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Traditional  Streaming  Systems  Fault  tolerance  via  replication  or  upstream  backup:  

node  1  

node  3  

node  2  

node  1’  

node  3’  

node  2’  

synchronization  

node  1  

node  3  

node  2  

standby  

input  

input  

input  

input  

[Borealis,  Flux]   [Hwang  et  al,  2005]  

Neither  approach  can  handle  stragglers  

Page 52: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Observation  Batch  processing  models,  such  as  MapReduce,  do  provide  fault  tolerance  efficiently  » Divide  job  into  deterministic  tasks  » Rerun  failed/slow  tasks  in  parallel  on  other  nodes  

Idea:  run  streaming  computations  as  a  series  of  small,  deterministic  batch  jobs  » Same  recovery  schemes  at  much  smaller  timescale  » To  make  latency  low,  store  state  in  RDDs  

Page 53: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Discretized  Stream  Processing  

t  =  1:  

t  =  2:  

batch  operation  

pull  input  …  

…  

input  

immutable  dataset  (stored  reliably)  

immutable  dataset  (output  or  state);  stored  as  RDD  

Page 54: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Fault  Recovery  Checkpoint  state  RDDs  periodically  

If  a  node  fails/straggles,  rebuild  lost  RDD  partitions  in  parallel  on  other  nodes  

map  

input  dataset   output  dataset  

Faster  recovery  than  upstream  backup,  without  the  cost  of  replication  

Page 55: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

How  Fast  Can  It  Go?  Can  process  over  60M  records/s  (6  GB/s)  on  100  nodes  at  sub-­‐second  latency  

 

Max  throughput  under  a  given  latency  (1  or  2s)  

0  

20  

40  

60  

80  

0   50   100  

Rec

ords

/s  (m

illions

)  

Nodes  in  Cluster  

Grep  

1  sec  2  sec   0  

10  

20  

30  

40  

0   50   100  

Rec

ords

/s  (m

illions

)  

Nodes  in  Cluster  

Top  K  Words  

1  sec  2  sec  

Page 56: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Comparison  with  Storm  

Storm  limited  to  100K  records/s/node  

Also  tried  S4:  10K  records/s/node  

Commercial  systems:  O(500K)  total  

0  

10  

20  

30  

100   1000  

MB/s  per  nod

e  Record  Size  (bytes)  

Top  K  Words  Spark  Storm  

0  

20  

40  

60  

80  

100   1000  

MB/s  per  nod

e  

Record  Size  (bytes)  

Grep  Spark  Storm  

Lack  Spark’s  FT  guarantees  

Page 57: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

How  Fast  Can  It  Recover?  Recovers  from  faults/stragglers  within  1  second  

Sliding  WordCount  on  20  nodes  with  10s  checkpoint  interval  

0  

0.5  

1  

1.5  

0   5   10   15   20   25  

Interval  Proce

ssing  

Time  (s)  

Time  (s)  

Failure  happens  

Page 58: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Programming  Interface  Extension  to  Spark:  Spark  Streaming  » All  Spark  operators  plus  new  “stateful”  ones  

 

// Running count of pageviews by URL

views = readStream("http:...", "1s")

ones = views.map(ev => (ev.url, 1))

counts = ones.runningReduce(_ + _)

t  =  1:  

t  =  2:  

views ones counts

map   reduce  

. . .

=  RDD   =  partition  

“Discretized  stream”  (D-­‐stream)  

Page 59: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Incremental  Operators  words.reduceByWindow(“5s”, max)

words   interval  max  

sliding  max  

t-­‐1  

t  

t+1  

t+2  

t+3  

t+4  max  

   

   

Associative  function  

words   interval  counts  

sliding  counts  

t-­‐1  

t  

t+1  

t+2  

t+3  

t+4   +  

+

–  

words.reduceByWindow(“5s”, _+_, _-_)

Associative  &  invertible  

Page 60: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Applications  Conviva  video  dashboard  

Mobile  Millennium  traffic  estimation  

0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  

0   16   32   48   64  

Active  vide

o  se

ssions

 (m

illions

)  

Nodes  in  Cluster  

0  

500  

1000  

1500  

2000  

0   20   40   60   80  GPS  ob

servations

 /  se

c  

Nodes  in  Cluster  

(online  EM  algorithm)  (>50  session-­‐level  metrics)  

Page 61: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Unifying  Streaming  and  Batch  

D-­‐streams  and  RDDs  can  seamlessly  be  combined  » Same  execution  and  fault  recovery  models  

Enables  powerful  features:  » Combining  streams  with  historical  data:  

! pageViews.join(historicCounts).map(...)  

» Interactive  ad-­‐hoc  queries  on  stream  state:   ! pageViews.slice(“21:00”,“21:05”).topK(10)

used  in  MM  application  

used  in  Conviva  app  

Page 62: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Benefits  of  a  Unified  Stack  

Write  each  algorithm  only  once  

Reuse  data  across  streaming  &  batch  jobs  

Query  stream  state  instead  of  waiting  for  import  

 Some  users  were  doing  this  manually!  » Conviva  anomaly  detection,  Quantifind  dashboard  

Page 63: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Conclusion  “Big  data”  is  moving  beyond  one-­‐pass  batch  jobs,  to  low-­‐latency  apps  that  need  data  sharing  

RDDs  offer  fault-­‐tolerant  sharing  at  memory  speed  

Spark  uses  them  to  combine  streaming,  batch  &  interactive  analytics  in  one  system  

www.spark-­‐project.org  

Page 64: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Related  Work  DryadLINQ,  FlumeJava  » Similar  “distributed  collection”  API,  but  cannot  reuse  datasets  efficiently  across  queries  

GraphLab,  Piccolo,  BigTable,  RAMCloud  » Fine-­‐grained  writes  requiring  replication  or  checkpoints  

Iterative  MapReduce  (e.g.  Twister,  HaLoop)  » Implicit  data  sharing  for  a  fixed  computation  pattern  

Relational  databases  » Lineage/provenance,  logical  logging,  materialized  views  

Caching  systems  (e.g.  Nectar)  » Store  data  in  files,  no  explicit  control  over  what  is  cached  

Page 65: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

Behavior  with  Not  Enough  RAM  

68.8  

58.1  

40.7  

29.7  

11.5  

0  

20  

40  

60  

80  

100  

Cache  disabled  

25%   50%   75%   Fully  cached  

Iteration  time  (s)  

%  of  working  set  in  memory  

Page 66: Spark - blogs.ischoolblogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/11/spark_twitter.… · Overview Spark’is’a’parallel’framework’that’provides:’ » Efficient’primitives’forin6memory’data’sharing’

RDDs  for  Debugging  Debugging  general  distributed  apps  is  very  hard  

However,  Spark  and  other  recent  frameworks  run  deterministic  tasks  for  fault  tolerance  

Leverage  this  determinism  for  debugging:  » Log  lineage  for  all  RDDs  created  (small)  » Let  user  replay  any  task  in  jdb,  rebuild  any  RDD  to  query  it  interactively,  or  check  assertions