Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf ·...

33
Patrick Wendell Databricks Big Data Processing

Transcript of Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf ·...

Page 1: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Patrick Wendell Databricks

Big Data Processing

Page 2: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

About me

Committer and PMC member of Apache Spark

“Former” PhD student at Berkeley

Left Berkeley to help found Databricks

Now managing open source work at Databricks

Focus is on networking and operating systems

Page 3: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Outline Overview of today

The Big Data problem

The Spark Computing Framework

Page 4: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Straw poll. Have you: -  Written code in a numerical programming

environment (R/Matlab/Weka/etc)?

-  Written code in a general programming language (Python/Java/etc)?

-  Written multi-threaded or distributed computer programs?

Page 5: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Today’s workshop Overview of trends in large-scale data analysis

Introduction to the Spark cluster-computing engine with a focus on numerical computing

Hands-on workshop with TA’s covering Scala basics and using Spark for machine learning

Page 6: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Hands-on Exercises You’ll be given a cluster of 5 machines*

(a) Text mining of full text of Wikipedia

(b) Movie recommendations with ALS

5 machines * 200 people = 1,000 machines

Page 7: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Outline Overview of today

The Big Data problem

The Spark Computing Framework

Page 8: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

The Big Data Problem Data is growing faster than computation speeds

Growing data sources » Web, mobile, scientific, …

Cheap storage » Doubling every 18 months

Stalling CPU speeds

Storage bottlenecks

Page 9: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Examples Facebook’s daily logs: 60 TB

1000 genomes project: 200 TB

Google web index: 10+ PB

Cost of 1 TB of disk: $50

Time to read 1 TB from disk: 6 hours (50 MB/s)

Page 10: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

The Big Data Problem Single machine can no longer process or even store all the data!

Only solution is to distribute over large clusters

Page 11: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Google Datacenter

How  do  we  program  this  thing?  

Page 12: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

What’s hard about cluster computing?

How to divide work across nodes? •  Must consider network, data locality •  Moving data may be very expensive

How to deal with failures?

•  1 server fails every 3 years => 10K nodes see 10 faults/day

•  Even worse: stragglers (node is not failed, but slow)

Page 13: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

A comparison: How to divide work across nodes? What if certain elements of a matrix/array became 1000 times slower to access than others? How to deal with failures? What if with 10% probability certain variables in your program became undefined?

Page 14: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Outline Overview of today

The Big Data problem

The Spark Computing Framework

Page 15: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

The Spark Computing Framework Provides a programming abstraction and parallel runtime to hide this complexity.

“Here’s an operation, run it on all of the data” » I don’t care where it runs (you schedule that) » In fact, feel free to run it twice on different nodes

Page 16: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Resilient Distributed Datasets Key programming abstraction in Spark

Think “parallel data frame”

Supports collections/set API’s #  get  variance  of  5th  field  of  a  tab-­‐delimited  dataset  rdd  =  spark.hadoopFile(“big-­‐file.txt”)    rdd.map(line  =>  line.split(“\t”)(4))        .map(cell  =>  cell.toDouble())        .variance()  

Page 17: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

RDD Execution Automatically split work into many small, idempotent tasks

Send tasks to nodes based on data locality

Load-balance dynamically as tasks finish

Page 18: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Fault Recovery 1. If a task crashes: » Retry on another node

•  OK for a map because it had no dependencies •  OK for reduce because map outputs are on disk

Requires  user  code  to  be  deterministic  

Page 19: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Fault  Recovery  

2. If a node crashes: » Relaunch its current tasks on other nodes » Relaunch any maps the node previously ran

•  Necessary because their output files were lost

Page 20: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Fault  Recovery  

3. If a task is going slowly (straggler): » Launch second copy of task on another node » Take the output of whichever copy finishes first,

and kill the other one

Page 21: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Spark Compared with Earlier Approaches Higher level, declarative API.

Built from the “ground up” for performance – including support for leveraging distributed cluster memory.

Optimized for iterative computations (e.g. machine learning)

Page 22: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Spark Platform

Spark  RDD  API  

Page 23: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Spark Platform: GraphX

Spark  RDD  API  

GraphX  Graph  (alpha)  

RDD-­‐Based    Graphs  

graph  =  Graph(vertexRDD,  edgeRDD)  graph.connectedComponents()  #  returns  a  new  RDD  

Page 24: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Spark Platform: MLLib

Spark  RDD  API  

GraphX  Graph  (alpha)  

MLLib  machine  learning  

RDD-­‐Based    Matrices  

RDD-­‐Based    Graphs  

model  =  LogisticRegressionWithSGD.train(trainRDD)  dataRDD.map(point  =>  model.predict(point))  

Page 25: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Spark Platform: Streaming

Spark  RDD  API  

GraphX  Graph  (alpha)  

MLLib  machine  learning  

RDD-­‐Based    Matrices  

RDD-­‐Based    Graphs  

dstream  =  spark.networkInputStream()  dstream.countByWindow(Seconds(30))  

Streaming  

RDD-­‐Based    Streams  

Page 26: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Spark Platform: SQL

Spark  RDD  API  

GraphX  Graph  (alpha)  

MLLib  machine  learning  

RDD-­‐Based    Matrices  

RDD-­‐Based    Graphs  

rdd  =  sql“select  *  from  rdd1  where  age  >  10”  

Streaming  

RDD-­‐Based    Streams  

SQL  

Schema  RDD’s  

Page 27: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Performance Im

pala  (d

isk)  

Impa

la  (m

em)  

Reds

hift  

Shark  (disk)  

Shark  (m

em)  

0  

5  

10  

15  

20  

25  

Res

pons

e  Time  (s)  

SQL[1]  

Storm  

Spark  

0  

5  

10  

15  

20  

25  

30  

35  

Throug

hput  (M

B/s/nod

e)  

Streaming[2]  

Had

oop  

Gira

ph  

Graph

X  

0  

5  

10  

15  

20  

25  

30  

Res

pons

e  Time  (m

in)  

Graph[3]  [1] https://amplab.cs.berkeley.edu/benchmark/ [2] Discretized Streams: Fault-Tolerant Streaming Computation at Scale. At SOSP 2013. [3] https://amplab.cs.berkeley.edu/publication/graphx-grades/

Page 28: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Composition Most large-scale data programs involve parsing, filtering, and cleaning data.

Spark allows users to compose patterns elegantly. E.g.:

Select input dataset with SQL, then run machine learning on the result.

Page 29: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

One of the largest open source projects in big data

150+ developers contributing

30+ companies contributing

Spark Community

0

50

100

150 Contributors in past year

Page 30: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Community Growth

Spark 0.6: 17 contributors

Sept ‘13 Feb ‘13 Oct ‘12

Spark 0.7:"31 contributors

Spark 0.8: 67 contributors

Feb ‘14

Spark 0.9: 83 contributors

Page 31: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Databricks Primary developers of Spark today

Founded by Spark project creators

A nexus of several research areas:

à OS and computer architecture

à Networking and distributed systems

à Statistics and machine learning

Page 32: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Today’s Speakers

Holden Karau – Worked on large-scale storage infrastructure @ Google.

Intro to Spark and Scala

Hossein Falaki – Data scientist at Apple.

Numerical Programming with Spark

Xiangrui Meng – PhD from Stanford ICME. Lead contributor on MLLib.

Deep dive on Spark’s MLLib

Page 33: Big Data Processing - Stanford Universitystanford.edu/~rezab/sparkworkshop/slides/patrick.pdf · Most large-scale data programs involve parsing, filtering, and cleaning data. Spark

Questions?