Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 ·...

36
Reza Zadeh Distributed Computing with Spark and MapReduce @Reza_Zadeh | http://reza-zadeh.com

Transcript of Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 ·...

Page 1: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Reza Zadeh

Distributed Computing with Spark and MapReduce

@Reza_Zadeh | http://reza-zadeh.com

Page 2: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Traditional Network Programming

Message-passing between nodes (e.g. MPI)

Very difficult to do at scale:»How to split problem across nodes?• Must consider network & data locality

»How to deal with failures? (inevitable at scale)»Even worse: stragglers (node not failed, but slow)»Ethernet networking not fast»Have to write programs for each machine

Rarely used in commodity datacenters

Page 3: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Data Flow ModelsRestrict the programming interface so that the system can do more automaticallyExpress jobs as graphs of high-level operators»System picks how to split each operator into tasks

and where to run each task»Run parts twice fault recovery

Biggest example: MapReduceMap

Map

Map

Reduce

Reduce

Page 4: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

MapReduce + GFSMost of early Google infrastructure, tremendously successful

Replicate disk content 3 times, sometimes 8

Rewrite algorithms for MapReduce

Page 5: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Diagram of typical clusterhttp://insightdataengineering.com/blog/pipeline_map.html

Page 6: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Example MapReduce AlgorithmsMatrix-vector multiplication

Power iteration (e.g. PageRank)Gradient descent methods

Stochastic SVDTall skinny QR

Many others!

Page 7: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Why Use a Data Flow Engine?Ease of programming»High-level functions instead of message passing

Wide deployment»More common than MPI, especially “near” data

Scalability to very largest clusters»Even HPC world is now concerned about resilience

Examples: Pig, Hive, Scalding

Page 8: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Limitations of MapReduce

Page 9: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Limitations of MapReduceMapReduce is great at one-pass computation, but inefficient for multi-pass algorithmsNo efficient primitives for data sharing»State between steps goes to distributed file system»Slow due to replication & disk storage

Page 10: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

iter. 1 iter. 2 . . .

Input

file systemread

file systemwrite

file systemread

file systemwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

file systemread

Commonly spend 90% of time doing I/O

Example: Iterative Apps

Page 11: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Example: PageRankRepeatedly multiply sparse matrix and vector

Requires repeatedly hashing together page adjacency lists and rank vector

Neighbors(id, edges)

Ranks(id, rank) …

Same file groupedover and over

iteration 1 iteration 2 iteration 3

Page 12: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

ResultWhile MapReduce is simple, it can require asymptotically more communication or I/O

Page 13: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

VerdictMapReduce algorithms research doesn’t go to waste, it just gets sped up and easier to use

Still useful to study as an algorithmic framework, silly to use directly

Page 14: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Spark computing engine

Page 15: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Spark Computing EngineExtends a programming language with a distributed collection data-structure» “Resilient distributed datasets” (RDD)

Open source at Apache»Most active community in big data, with 50+

companies contributing

Clean APIs in Java, Scala, Python, R

Page 16: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Resilient Distributed Datasets (RDDs)

Main idea: Resilient Distributed Datasets» Immutable collections of objects, spread across cluster» Statically typed: RDD[T] has objects of type T

val sc = new SparkContext()val lines = sc.textFile("log.txt") // RDD[String]

// Transform using standard collection operationsval errors = lines.filter(_.startsWith("ERROR"))val messages = errors.map(_.split(‘\t’)(2))

messages.saveAsTextFile("errors.txt")

lazily evaluated

kicks off a computation

Page 17: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Key IdeaResilient Distributed Datasets (RDDs)»Collections of objects across a cluster with user

controlled partitioning & storage (memory, disk, ...)»Built via parallel transformations (map, filter, …)»The world only lets you make make RDDs such that

they can be:

Automatically rebuilt on failure

Page 18: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Python, Java, Scala, R// Scala:

val lines = sc.textFile(...)lines.filter(x => x.contains(“ERROR”)).count()

// Java (better in java8!):

JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() {

Boolean call(String s) {return s.contains(“error”);

}}).count();

Page 19: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Fault Tolerance

file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10)

filterreducemap

Inpu

t file

RDDs track lineage info to rebuild lost data

Page 20: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

filterreducemap

Inpu

t file

Fault Tolerance

file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10)

RDDs track lineage info to rebuild lost data

Page 21: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Partitioning

file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10)

filterreducemap

Inpu

t file

RDDs know their partitioning functions

Known to behash-partitioned

Also known

Page 22: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

State of the Spark ecosystem

Page 23: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Most active open source community in big data

200+ developers, 50+ companies contributing

Spark Community

Giraph Storm

0

50

100

150

Contributors in past year

Page 24: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Project ActivityM

apRe

duce

YARN

HDFS

Stor

mSp

ark

0

200

400

600

800

1000

1200

1400

1600

Map

Redu

ceYA

RNHD

FSSt

orm

Spar

k

0

50000

100000

150000

200000

250000

300000

350000

Commits Lines of Code Changed

Activity in past 6 months

Page 25: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Continuing Growth

source: ohloh.net

Contributors per month to Spark

Page 26: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Built-in libraries

Page 27: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Standard Library for Big DataBig data apps lack librariesof common algorithmsSpark’s generality + supportfor multiple languages make itsuitable to offer this

Core

SQL ML graph…

Python Scala Java R

Much of future activity will be in these libraries

Page 28: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

A General Platform

Spark Core

SparkStreaming

real-time

Spark SQLstructured

GraphXgraph

MLlibmachinelearning

Standard libraries included with Spark

Page 29: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Machine Learning Library (MLlib)

40 contributors in past year

points = context.sql(“select latitude, longitude from tweets”)

model = KMeans.train(points, 10)

Page 30: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

MLlib algorithmsclassification: logistic regression, linear SVM,naïve Bayes, classification treeregression: generalized linear models (GLMs), regression treecollaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF)clustering: k-means||decomposition: SVD, PCAoptimization: stochastic gradient descent, L-BFGS

Page 31: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

31

GraphX

Page 32: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

32

General graph processing library

Build graph using RDDs of nodes and edges

Large library of graph algorithms with composable steps

GraphX

Page 33: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Spark StreamingRun a streaming computation as a series of very small, deterministic batch jobs

33

Spark

SparkStreaming

batchesofXseconds

live datastream

processedresults

• ChopupthelivestreamintobatchesofXseconds

• SparktreatseachbatchofdataasRDDsandprocessesthemusingRDDoperations

• Finally,theprocessedresultsoftheRDDoperationsarereturnedinbatches

Page 34: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Spark StreamingRun a streaming computation as a series of very small, deterministic batch jobs

34

Spark

SparkStreaming

batchesofXseconds

live datastream

processedresults

• Batchsizesaslowas½second,latency~1second

• Potentialforcombiningbatchprocessingandstreamingprocessinginthesamesystem

Page 35: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Spark SQL// Run SQL statementsval teenagers = context.sql(

"SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are RDDs of Row objectsval names = teenagers.map(t => "Name: " + t(0)).collect()

Page 36: Distributed Computing with Spark and MapReducerezab/classes/cme323/S17/notes/... · 2018-01-02 · Resilient Distributed Datasets (RDDs) »Collections of objects across a cluster

Enables loading & querying structured data in Spark

c = HiveContext(sc)rows = c.sql(“select text, year from hivetable”)rows.filter(lambda r: r.year > 2013).collect()

From Hive:

{“text”: “hi”, “user”: {“name”: “matei”,“id”: 123

}}

c.jsonFile(“tweets.json”).registerAsTable(“tweets”)c.sql(“select text, user.name from tweets”)

From JSON: tweets.json

Spark SQL