Programming Systems for Big Data · Scala •Spark is integrated into the Scala programming...

13
1 Prof. Aiken CS 315B Lecture 15 Programming Systems for Big Data CS315B Lecture 15 Including material from Kunle Olukotun 1 Big Data We’ve focused on parallel programming for computational science There is another class of programming systems focused on “Big Data” – MapReduce – Spark – TensorFlow Prof. Aiken CS 315B Lecture 15 2 Warehouse Size Cluster Prof. Aiken CS 315B Lecture 15 3 Example: Google Cluster Prof. Aiken CS 315B Lecture 15 4

Transcript of Programming Systems for Big Data · Scala •Spark is integrated into the Scala programming...

Page 1: Programming Systems for Big Data · Scala •Spark is integrated into the Scala programming language –Java dialect –With functional programming features ... –Implies data is

1

Prof. Aiken CS 315B Lecture 15

Programming Systems for Big Data

CS315BLecture 15

Including material from Kunle Olukotun

1

Big Data

• We’ve focused on parallel programming for computational science

• There is another class of programming systems focused on “Big Data”– MapReduce– Spark– TensorFlow

Prof. Aiken CS 315B Lecture 15 2

Warehouse Size Cluster

Prof. Aiken CS 315B Lecture 15 3

Example: Google Cluster

Prof. Aiken CS 315B Lecture 15 4

Page 2: Programming Systems for Big Data · Scala •Spark is integrated into the Scala programming language –Java dialect –With functional programming features ... –Implies data is

2

Commodity Cluster Architecture

Prof. Aiken CS 315B Lecture 15

Mem

Disk

CPU

Mem

Disk

CPU

Switch

Each rack contains 16-64 nodes

Mem

Disk

CPU

Mem

Disk

CPU

Switch

Switch1 Gbps between any pair of nodesin a rack

2-10 Gbps backbone between racks

8 cores

64-256 GB

10-30 TB

5

Commodity Cluster Trends

Prof. Aiken CS 315B Lecture 15 6

Storing Big Data

Prof. Aiken CS 315B Lecture 15 7

Stable Storage

• If nodes can fail, how can we store data persistently?

• Answer: Distributed File System– Provides global file namespace– GFS, HDFS

• Note: Not HDF5!

• Typical usage pattern– Huge files (100s of GB to TB)– Data is rarely updated in place– Reads and appends are common (e.g. log files)

Prof. Aiken CS 315B Lecture 158

Page 3: Programming Systems for Big Data · Scala •Spark is integrated into the Scala programming language –Java dialect –With functional programming features ... –Implies data is

3

Distributed File System• Chunk Servers

– a.k.a. Data Nodes in HDFS– File is split into contiguous chunks– Typically each chunk is 16-128MB– Each chunk replicated (usually 2x or 3x)– Try to keep replicas in different racks

• Master node– a.k.a. Name Nodes in HDFS– Stores metadata– Might be replicated

• Client library for file access– Talks to master to find chunk (data) servers – Connects directly to chunk servers to access data

Prof. Aiken CS 315B Lecture 15 9

Hadoop Distributed File System (HDFS)

Prof. Aiken CS 315B Lecture 15

• Global namespace

• Files are broken into blocks– Typically 128 MB block size– Each block replicated on multiple DataNodes

• Intelligent Client– Client can find location of blocks– Client accesses data directly from DataNode

10

MapReduce

Prof. Aiken CS 315B Lecture 15 11

The Programming Model

• A program consists of two functions– Map function f– Reduce function g

• In the map phase– The map function f is applied to every data “chunk”– Output is a set of <key,value> pairs

• In the reduce phase– The reduce function g is applied once to all values with the

same key

Prof. Aiken CS 315B Lecture 15 12

Page 4: Programming Systems for Big Data · Scala •Spark is integrated into the Scala programming language –Java dialect –With functional programming features ... –Implies data is

4

Picture

Map

Map

Map

Reduce

Reduce

Input Output

Prof. Aiken CS 315B Lecture 15 13

What is MapReduce?

• Dataflow language– A graph of

• Nodes that are computation• Edges that carry data

• In particular, MapReduce graphs are acyclic– Like Legion, StarPU, …

• And very restricted

Prof. Aiken CS 315B Lecture 15 14

MapReduce Provides

• Automatic parallelization & distribution

• Fault tolerance

• I/O scheduling

• Monitoring & status updates

Prof. Aiken CS 315B Lecture 15 15

MapReduce: Distributed ExecutionUser

Program

Worker

Worker

Master

Worker

Worker

Worker

fork fork fork

assignmap

assignreduce

readlocalwrite

remoteread,sort

OutputFile 0

OutputFile 1

writeSplit 0Split 1Split 2

Input Data

Prof. Aiken CS 315B Lecture 15 16

Page 5: Programming Systems for Big Data · Scala •Spark is integrated into the Scala programming language –Java dialect –With functional programming features ... –Implies data is

5

Data Flow

• Input, final output are stored on a DFS– Scheduler tries to schedule map tasks �close� to physical

storage location of input data• Same node or same rack

– Data locality of I/O is important• Bisection bandwidth of network is low (~10 Gb/s)

• Intermediate results are stored on the local FS of map and reduce workers

• Output is often input to another map reduce task

Prof. Aiken CS 315B Lecture 1517

Coordination: The Master

• Master data structures– Task status: (idle, in-progress, completed)– Idle tasks get scheduled as workers become

available– When a map task completes, it sends the master

the location and sizes of its R intermediate files, one for each reducer

– Master pushes this info to reducers

• Master pings workers periodically to detect failures

Prof. Aiken CS 315B Lecture 1518

Failures

• Map worker failure– Reduce workers are notified when task is rescheduled on

another worker

• Reduce worker failure– Reduce task is rescheduled

• Master failure– MapReduce task is aborted and client is notified

Prof. Aiken CS 315B Lecture 1519

How many Map and Reduce jobs?

• M map tasks, R reduce tasks

• Rule of thumb:– Make M and R much larger than the number of CPUS in cluster (

8000 CPUs ⇒ M = 800,000 ⇒ 100 tasks per CPU for map)– One DFS chunk per map is common (800, 000 x 128 MB = 102 TB)– Improves dynamic load balancing and speeds recovery from

worker failure

• Usually R is smaller than M, because output is spread across R files

Prof. Aiken CS 315B Lecture 1520

Page 6: Programming Systems for Big Data · Scala •Spark is integrated into the Scala programming language –Java dialect –With functional programming features ... –Implies data is

6

Partition Function

• Inputs to map tasks are created by contiguous splits of input file at chunk granularity

• For reduce, we need to ensure that records with the same intermediate key end up at the same worker

• System uses a default partition function e.g., hash(key) mod R

• Sometimes useful to override – E.g., hash(hostname(URL)) mod R ensures URLs from a host

end up in the same output file

Prof. Aiken CS 315B Lecture 1521

Combiners

• Often a map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k– E.g., popular words in Word Count

• Can save network time by pre-aggregating at mapper– combine(k1, list(v1)) à v2– Usually same as reduce function

• Works only if reduce function is commutative and associative

Prof. Aiken CS 315B Lecture 1522

Execution Summary

• map() reduce()1. Partition input key/value pairs into chunks, run

map() tasks in parallel2. After all map()s are complete, consolidate all

emitted values for each unique emitted key3. Now partition space of output map keys, and run

reduce() in parallel• If map() or reduce() fails, reexecute!

Prof. Aiken CS 315B Lecture 1523

MapReduce & Hadoop Conclusions

• MapReduce has proven to be a useful abstraction for huge scale data parallelism– Greatly simplifies large-scale computations at Google, Yahoo,

etc.

• Easy to use – Library deals w/ messy details of task placement, data

movement, fault tolerance

• Not efficient or expressive enough for all problems – Requires huge data to be worthwhile

Prof. Aiken CS 315B Lecture 1524

Page 7: Programming Systems for Big Data · Scala •Spark is integrated into the Scala programming language –Java dialect –With functional programming features ... –Implies data is

7

Spark

Prof. Aiken CS 315B Lecture 15 25

Spark Goals

• Extend MapReduce to better support two common classes of data analytics:– Iterative algorithms

• machine learning, graphs– Interactive data mining

Prof. Aiken CS 315B Lecture 15 26

Scala

• Spark is integrated into the Scalaprogramming language– Java dialect– With functional programming features

• Improves programmability over MapReduceimplementations– Mostly because Scala is just a more modern

programming language

Prof. Aiken CS 315B Lecture 15 27

Motivation

• MapReduce is inefficient for applications that repeatedly reuse data– Recall MapReduce programs are acyclic– Only way to encode an iterative algorithm is to

wrap a MapReduce program in a loop– Implies data is reloaded from stable storage on

each iteration

Prof. Aiken CS 315B Lecture 15 28

Page 8: Programming Systems for Big Data · Scala •Spark is integrated into the Scala programming language –Java dialect –With functional programming features ... –Implies data is

8

Programming Model

Resilient distributed datasets (RDDs)– Immutable, partitioned collections of objects– Created through parallel transformations (map,

filter, groupBy, join, …) on data in stable storage– Can be cached for efficient reuse

Actions on RDDs– Count, reduce, collect, save, …– Generate result on master

Prof. Aiken CS 315B Lecture 15 29

Transformations

// Load text file from local FS, HDFS, or S3val rdd = spark.textFile(“hdfs://namenode:0/path/file”)val nums = spark.parallelize(List(1, 2, 3))

// Pass each element through a function val squares = nums.map(x => x*x) // {1, 4, 9}

// Keep elements passing a predicate val even = squares.filter(x => x % 2 == 0) // {4}

// Map each element to zero or more others nums.flatMap(x => 1 to x) // => {1, 1, 2, 1, 2, 3}

Prof. Aiken CS 315B Lecture 15

Sequence of numbers 1, 2, …, x

Create an RDD from a Scalacollection

30

Actions

val nums = spark.parallelize(List(1, 2, 3))// Retrieve RDD contents as a local collectionnums.collect() // => Array(1, 2, 3) could be too big!

// Return first K elements nums.take(2) // => Array(1, 2)

// Count number of elements nums.count() // => 3

// Merge elements with an associative functionnums.reduce((a, b) => a + b) // => 6

// Write elements to a text file nums.saveAsTextFile(“hdfs://file.txt”)

Prof. Aiken CS 315B Lecture 1531

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDDTransformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)Prof. Aiken CS 315B Lecture 15 32

Page 9: Programming Systems for Big Data · Scala •Spark is integrated into the Scala programming language –Java dialect –With functional programming features ... –Implies data is

9

RDD Fault Tolerance

RDDs maintain lineage information that can be used to reconstruct lost partitions

Ex:messages = textFile(...).filter(_.startsWith(“ERROR”))

.map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDD

filter(func = _.startsWith(...))

map(func = _.split(...))

Prof. Aiken CS 315B Lecture 15 33

Example: Logistic Regression

Goal: find best line separating two sets of points

+

+ ++

+

+

++ +

– ––

–– –

+

target

random initial line

Prof. Aiken CS 315B Lecture 15 34

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D) //w is mutable i.e. not

functional

for (i <- 1 to ITERATIONS) {

val gradient = data.map(p =>(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x

).reduce((a,b) => a + b)w -= gradient

}

println("Final w: " + w)

// for loop and gradient update run on master

// map and reduce run on clusterProf. Aiken CS 315B Lecture 15 35

Logistic Regression Performance

127 s / iteration

first iteration 174 sfurther iterations 6 s

29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)Prof. Aiken CS 315B Lecture 15 36

Page 10: Programming Systems for Big Data · Scala •Spark is integrated into the Scala programming language –Java dialect –With functional programming features ... –Implies data is

10

Spark Discussion• Keep benefits of MapReduce with more traditional

data parallel functional programming model

• Higher performance by keeping intermediate data in memory instead of disk– Memory has 10,000x better latency and 100X better

bandwidth than disk

• Fault tolerance comes from functional programming model– Model breaks when you have non-functional code (use vars)

Prof. Aiken CS 315B Lecture 15 37

Spark Discussion

• Data partitioning is built-in for MapReduceand Spark

• Initial partitioning is just chunking data sets

• Limited set of operations on partitioned data simplifies communication and placement– Map, reduce, …

Prof. Aiken CS 315B Lecture 15 38

TensorFlow

Prof. Aiken CS 315B Lecture 15 39

TensorFlow

• Another dataflow model

• Focused on machine learning applications– More on this shortly

• Basic data type is a tensor– A multidimensional array

Prof. Aiken CS 315B Lecture 15 40

Page 11: Programming Systems for Big Data · Scala •Spark is integrated into the Scala programming language –Java dialect –With functional programming features ... –Implies data is

11

TensorFlow Example

Prof. Aiken CS 315B Lecture 15 41

The Dataflow Graph

Prof. Aiken CS 315B Lecture 15 42

Why TensorFlow?

• Dataflow model makes tasks explicit– Units of scheduling

• One major motivation for Tensorflow is to make programming GPUs and clusters easier– Tasks can have variants– Tasks can be assigned to GPUs or CPUs

• If an appropriate variant is available– Supports 1 node and multi-node execution– Implementation has a built-in mapping heuristic

Prof. Aiken CS 315B Lecture 15 43

Data and Communication

• Once tasks are assigned, it is clear where data communication is required– E.g., if source task is on the CPU and destination

task is on the GPU

• Implementation automatically inserts copy operations to move data to where it is needed– Not clear if multiple alternatives are considered– E.g., zero-copy vs. frame buffer memory on the

GPU

Prof. Aiken CS 315B Lecture 15 44

Page 12: Programming Systems for Big Data · Scala •Spark is integrated into the Scala programming language –Java dialect –With functional programming features ... –Implies data is

12

Sessions

• Typically the same graph is reused many times

• A session– Sets up a Tensorflow graph– Provides hooks to call the graph with different

inputs/outputs

• Also options to call only a portion of the graph– E.g., a particular subgraph

Prof. Aiken CS 315B Lecture 15 45

Automatic Differentiation

• Many ML algorithms are essentially optimization algorithms and need to compute gradients

• TensorFlow has built-in support for computing the gradient function of a TensorFlow graph– Each primitive function has a gradient function– Primitive gradients are composed using the chain

rule

Prof. Aiken CS 315B Lecture 15 46

Automatic Differentiation Example

Prof. Aiken CS 315B Lecture 15 47

Other Features

• Some tensors can be updated in place– Leads to need for special “control flow” edges– Simply enforce ordering of side effects on stateful

tensors– Note lack of sequential semantics

• Control flow constructs– Loops, if-then-else– But note automatic differentiation doesn’t work for

if-then-else

Prof. Aiken CS 315B Lecture 15 48

Page 13: Programming Systems for Big Data · Scala •Spark is integrated into the Scala programming language –Java dialect –With functional programming features ... –Implies data is

13

Other Features

• Queues– Programmers can add “queues” to dataflow edges to

batch up work – And to allow different parts of the graph to

execute asynchronously

• Note execution is otherwise synchronous ...

Prof. Aiken CS 315B Lecture 15 49

Data Partitioning

• Interestingly, TensorFlow has no data partitioning primitives!– Not really a “big data” programming model– At least that are exposed to the users– Underlying linear algebra packages (BLAS) may be

chunk up arrays

• The task parallelism in the dataflow graph, and replicating the graph for multiple inputs scenarios, are the primary sources of parallelism Prof. Aiken CS 315B Lecture 15 50

Summary

• Big Data problems are inspiring their own class of programming models

• Different constraints– More data, less complex compute

• But also more focus on programmer productivity– No assumption of willingness to learn a lot about

parallel programmingProf. Aiken CS 315B Lecture 15 51