Spark UCSD 16Nov15 Zonca

8/20/2019 Spark UCSD 16Nov15 Zonca

1/49

Distributed computwith Spark and Pyth

A. Zonca - San Diego Supercomputer C


2/49

What is Spark?

● A distributed computing framework


3/49

Connect to a spark clus

Running on 4 instances of SDSC Cloud

http://bit.ly/sparknotebook

password: acluster

ALWAYS COPY WITH YOUR NAME

http://bit.ly/sparknotebookhttp://bit.ly/sparknotebook


4/49

House price example

see house_price.ipynb


5/49map-reduce diagram


6/49

What's the point?

● write simple mapper and reducer

● framework scales to thousands of ma


7/49

Problem 1: Storage

● Big data

● Commodity hardware (Cloud)

Solution: Distributed File System

● redundant

● fault tolerant


8/49

Problem 2: Computatio

● Slow to move data across network

● Computations fail

Solution: Hadoop Mapreduce / Spark

● Execute computation where data are

● Rerun failed jobs


9/49

Problem 3: Communica

● Most of the times, need to summarize

to get a result

● Reduction phase in MapReduce

● Need data transfer across network

Solution: highly optimized Shuffle (All-to-


10/49

Spark and Hadoop

● Works within the Hadoop ecosystem

● Extends MapReduce

● Initially developed at UC Berkeley

● Now within the Apache Foundation● ~400 and more developers


11/49

Key features of Spark

● Resiliency: tolerant to node failures

● Speed: supports in-memory caching

● Ease of use:

○ Python/Scala interfaces○ interactive shells

○ many distributed primitives


12/49

Spark 100TB benchmark


13/49

Worker Node

Spark

Executor

Java Virtual

Machine

Python

Python

Python

HDFS

Worker No


14/49

Exec

JVM

Worker No

Exec

JVM

Exec

JVM

Worker No


15/49

Exec

JVM

Worker No

Exec

JVM

Exec

JVM

Cluster Manager

YARN/Standalone

Provision/Restart Workers

Worker No


16/49

Exec

JVM

Worker No

Exec

JVM

Exec

JVM

Cluster

Manager

Spark

Context

Spark

Context

Driver Program

S k L l


17/49

Exec

JVM

StandaloneSpark

Context

Spark

Context

Driver Program

Spark Local

Computing

C t


18/49

Exec

JVM

Computing

Exec

JVM

Exec

JVM

Standalone CM

Spark

Context

Spark

Context

Driver Program

Master node

on Comet

EC2 node

A EMR


19/49

Exec

JVM

EC2 node

Exec

JVM

Exec

JVM

YARNSpark

Context

Spark

Context

Driver Program

Master node

on Amazon EMR

Computing

SDSC Cl d


20/49

Exec

JVM

Computing

Exec

JVM

Exec

JVM

Standalone

Spark

Context

Spark

Context

Driver Program

Master node

SDSC Cloud


21/49

House price with HDFS

see house_price_hdfs.ipynb


22/49

Resilient Distributed Datase

● Resilient: fault tolerant, lineage is sav

partitions can be recovered

● Distributed: partitions are automatical

distributed across nodes● Created from: HDFS, S3, HBase, Loc

Local hierarchy of folders


23/49

map

map : apply function to each element of RD

RDD Partitions map


24/49

Other transformations

• filter(func) - keep only elements whis true

• sample(withReplacement, fraction,

get a random data fraction• coalesce(numPartitions) - merge pa

to reduce them to numPartitions


25/49

filter

filter : keep only elements where func

RDD Partitions filter


26/49

coalesce

sc.parallelize(range(10), 4).glom().collect()Out[]: [[0, 1], [2, 3], [4, 5], [6, 7, 8, 9]]

sc.parallelize(range(10), 4).coalesce(2).glom().collOut[]: [[0, 1, 2, 3], [4, 5, 6, 7, 8, 9]]


27/49

coalesce

coalesce : reduce the number of partiti

RDD Partitions coalesce


28/49

Wide transformations


29/49

reduceByKey(func)

(K, V) pairs => (K, reduce V with fu

K A, 1

A A, 2m

A A, 3

redshuffle


30/49

Narrow vs Wid

map

K A, 1

A A, 2m

reducebyKey


31/49

Wide transformations

• groupByKey : (K, V) pairs => (K, iterable

V)

• reduceByKey(func) : (K, V) pairs => (K,

reduction by func on all V)

• repartition(numPartitions) : similar to coa

shuffles all data to increase or decrease n

of partitions to numPartitions


32/49

Shuffle


33/49

Shuffle

• Global redistribution of data

• High impact on performance


34/49

Shuffle

K A, 1

B, 5

A,2| B,8m

A A, [1, 2]

B, [5, 8]

writes todisk

r

dat

n


35/49

Know shuffle, avoid it

• Which operations cause it?

• Is it necessary?

R ll d B K


36/49

Really need groupByKey

groupByKey: (K, V) pairs => (K, iterable of

if you plan to call reduce later in the pipelin

use reduceByKey instead.

B K d


37/49

groupByKey + reduce

K A, 1

A,2| A,8m

A, [1,2,8] A, 11

d B K


38/49

reduceByKey

K A, 1

A,2| A,8m A, 10

A, [1, 10] A, 11


39/49

Extract data from RDD

● collect() - copy all elements to the driv● take(n) - copy first n elements

● saveAsTextFile(filename) - save to file

● reduce(func) - aggregate elements wi(takes 2 elements, returns 1)


40/49

Cached RDD

● Generally recommended after data cle● Reusing cached data: 10x speedup

● Great for iterative algorithms

● If RDD too large, will only be partially in memory


41/49

Directed Acyclic Graph Schedule

Di t d A li G h


42/49

Directed Acyclic Graph

Di t d A li G h


43/49

Directed Acyclic Graph

Track dependencies!

(also known as lineage or

provenance)

DAG in Spark


44/49

DAG in Spark

• nodes are RDDs• arrows are Transformations

Narrow vs Wid


45/49

Narrow vs Wid

map

K A, 1

A A, 2m

groupbyK

Narrow vs Wid


46/49

Narrow vs Wid

map

K A, 1

A A, 2m

groupbyK

A, 1

A, 2

Spark DAG of transformat


47/49

Spark DAG of transformat

K A, 1

A A, 2m

string

reducebyK

A, 1

A, 2

(k, v) HDFS

maptextFile

KA 1

A 1

maptextFile join


48/49

K A, 1

A A,2m

string

A, 1

A, 2

(k, v1) HDFS

maptextFile

K A, 1

(k,(v1, v2))

A, 1

A, 2

(k, v2)

HDFSstring (k, v1 * v2)

map

Thank you


49/49

Thank you

[email protected]

Spark UCSD 16Nov15 Zonca

Documents

Transcript of Spark UCSD 16Nov15 Zonca