Spark UCSD 16Nov15 Zonca
-
Upload
liz-jacksana -
Category
Documents
-
view
217 -
download
0
Transcript of Spark UCSD 16Nov15 Zonca
-
8/20/2019 Spark UCSD 16Nov15 Zonca
1/49
Distributed computwith Spark and Pyth
A. Zonca - San Diego Supercomputer C
-
8/20/2019 Spark UCSD 16Nov15 Zonca
2/49
What is Spark?
● A distributed computing framework
-
8/20/2019 Spark UCSD 16Nov15 Zonca
3/49
Connect to a spark clus
Running on 4 instances of SDSC Cloud
http://bit.ly/sparknotebook
password: acluster
ALWAYS COPY WITH YOUR NAME
http://bit.ly/sparknotebookhttp://bit.ly/sparknotebook
-
8/20/2019 Spark UCSD 16Nov15 Zonca
4/49
House price example
see house_price.ipynb
-
8/20/2019 Spark UCSD 16Nov15 Zonca
5/49map-reduce diagram
-
8/20/2019 Spark UCSD 16Nov15 Zonca
6/49
What's the point?
● write simple mapper and reducer
● framework scales to thousands of ma
-
8/20/2019 Spark UCSD 16Nov15 Zonca
7/49
Problem 1: Storage
● Big data
● Commodity hardware (Cloud)
Solution: Distributed File System
● redundant
● fault tolerant
-
8/20/2019 Spark UCSD 16Nov15 Zonca
8/49
Problem 2: Computatio
● Slow to move data across network
● Computations fail
Solution: Hadoop Mapreduce / Spark
● Execute computation where data are
● Rerun failed jobs
-
8/20/2019 Spark UCSD 16Nov15 Zonca
9/49
Problem 3: Communica
● Most of the times, need to summarize
to get a result
● Reduction phase in MapReduce
● Need data transfer across network
Solution: highly optimized Shuffle (All-to-
-
8/20/2019 Spark UCSD 16Nov15 Zonca
10/49
Spark and Hadoop
● Works within the Hadoop ecosystem
● Extends MapReduce
● Initially developed at UC Berkeley
● Now within the Apache Foundation● ~400 and more developers
-
8/20/2019 Spark UCSD 16Nov15 Zonca
11/49
Key features of Spark
● Resiliency: tolerant to node failures
● Speed: supports in-memory caching
● Ease of use:
○ Python/Scala interfaces○ interactive shells
○ many distributed primitives
-
8/20/2019 Spark UCSD 16Nov15 Zonca
12/49
Spark 100TB benchmark
-
8/20/2019 Spark UCSD 16Nov15 Zonca
13/49
Worker Node
Spark
Executor
Java Virtual
Machine
Python
Python
Python
HDFS
Worker No
-
8/20/2019 Spark UCSD 16Nov15 Zonca
14/49
Exec
JVM
Worker No
Exec
JVM
Exec
JVM
Worker No
-
8/20/2019 Spark UCSD 16Nov15 Zonca
15/49
Exec
JVM
Worker No
Exec
JVM
Exec
JVM
Cluster Manager
YARN/Standalone
Provision/Restart Workers
Worker No
-
8/20/2019 Spark UCSD 16Nov15 Zonca
16/49
Exec
JVM
Worker No
Exec
JVM
Exec
JVM
Cluster
Manager
Spark
Context
Spark
Context
Driver Program
S k L l
-
8/20/2019 Spark UCSD 16Nov15 Zonca
17/49
Exec
JVM
StandaloneSpark
Context
Spark
Context
Driver Program
Spark Local
Computing
C t
-
8/20/2019 Spark UCSD 16Nov15 Zonca
18/49
Exec
JVM
Computing
Exec
JVM
Exec
JVM
Standalone CM
Spark
Context
Spark
Context
Driver Program
Master node
on Comet
EC2 node
A EMR
-
8/20/2019 Spark UCSD 16Nov15 Zonca
19/49
Exec
JVM
EC2 node
Exec
JVM
Exec
JVM
YARNSpark
Context
Spark
Context
Driver Program
Master node
on Amazon EMR
Computing
SDSC Cl d
-
8/20/2019 Spark UCSD 16Nov15 Zonca
20/49
Exec
JVM
Computing
Exec
JVM
Exec
JVM
Standalone
Spark
Context
Spark
Context
Driver Program
Master node
SDSC Cloud
-
8/20/2019 Spark UCSD 16Nov15 Zonca
21/49
House price with HDFS
see house_price_hdfs.ipynb
-
8/20/2019 Spark UCSD 16Nov15 Zonca
22/49
Resilient Distributed Datase
● Resilient: fault tolerant, lineage is sav
partitions can be recovered
● Distributed: partitions are automatical
distributed across nodes● Created from: HDFS, S3, HBase, Loc
Local hierarchy of folders
-
8/20/2019 Spark UCSD 16Nov15 Zonca
23/49
map
map : apply function to each element of RD
RDD Partitions map
-
8/20/2019 Spark UCSD 16Nov15 Zonca
24/49
Other transformations
• filter(func) - keep only elements whis true
• sample(withReplacement, fraction,
get a random data fraction• coalesce(numPartitions) - merge pa
to reduce them to numPartitions
-
8/20/2019 Spark UCSD 16Nov15 Zonca
25/49
filter
filter : keep only elements where func
RDD Partitions filter
-
8/20/2019 Spark UCSD 16Nov15 Zonca
26/49
coalesce
sc.parallelize(range(10), 4).glom().collect()Out[]: [[0, 1], [2, 3], [4, 5], [6, 7, 8, 9]]
sc.parallelize(range(10), 4).coalesce(2).glom().collOut[]: [[0, 1, 2, 3], [4, 5, 6, 7, 8, 9]]
-
8/20/2019 Spark UCSD 16Nov15 Zonca
27/49
coalesce
coalesce : reduce the number of partiti
RDD Partitions coalesce
-
8/20/2019 Spark UCSD 16Nov15 Zonca
28/49
Wide transformations
-
8/20/2019 Spark UCSD 16Nov15 Zonca
29/49
reduceByKey(func)
(K, V) pairs => (K, reduce V with fu
K A, 1
A A, 2m
A A, 3
redshuffle
-
8/20/2019 Spark UCSD 16Nov15 Zonca
30/49
Narrow vs Wid
map
K A, 1
A A, 2m
reducebyKey
-
8/20/2019 Spark UCSD 16Nov15 Zonca
31/49
Wide transformations
• groupByKey : (K, V) pairs => (K, iterable
V)
• reduceByKey(func) : (K, V) pairs => (K,
reduction by func on all V)
• repartition(numPartitions) : similar to coa
shuffles all data to increase or decrease n
of partitions to numPartitions
-
8/20/2019 Spark UCSD 16Nov15 Zonca
32/49
Shuffle
-
8/20/2019 Spark UCSD 16Nov15 Zonca
33/49
Shuffle
• Global redistribution of data
• High impact on performance
-
8/20/2019 Spark UCSD 16Nov15 Zonca
34/49
Shuffle
K A, 1
B, 5
A,2| B,8m
A A, [1, 2]
B, [5, 8]
writes todisk
r
dat
n
-
8/20/2019 Spark UCSD 16Nov15 Zonca
35/49
Know shuffle, avoid it
• Which operations cause it?
• Is it necessary?
R ll d B K
-
8/20/2019 Spark UCSD 16Nov15 Zonca
36/49
Really need groupByKey
groupByKey: (K, V) pairs => (K, iterable of
if you plan to call reduce later in the pipelin
use reduceByKey instead.
B K d
-
8/20/2019 Spark UCSD 16Nov15 Zonca
37/49
groupByKey + reduce
K A, 1
A,2| A,8m
A, [1,2,8] A, 11
d B K
-
8/20/2019 Spark UCSD 16Nov15 Zonca
38/49
reduceByKey
K A, 1
A,2| A,8m A, 10
A, [1, 10] A, 11
-
8/20/2019 Spark UCSD 16Nov15 Zonca
39/49
Extract data from RDD
● collect() - copy all elements to the driv● take(n) - copy first n elements
● saveAsTextFile(filename) - save to file
● reduce(func) - aggregate elements wi(takes 2 elements, returns 1)
-
8/20/2019 Spark UCSD 16Nov15 Zonca
40/49
Cached RDD
● Generally recommended after data cle● Reusing cached data: 10x speedup
● Great for iterative algorithms
● If RDD too large, will only be partially in memory
-
8/20/2019 Spark UCSD 16Nov15 Zonca
41/49
Directed Acyclic Graph Schedule
Di t d A li G h
-
8/20/2019 Spark UCSD 16Nov15 Zonca
42/49
Directed Acyclic Graph
Di t d A li G h
-
8/20/2019 Spark UCSD 16Nov15 Zonca
43/49
Directed Acyclic Graph
Track dependencies!
(also known as lineage or
provenance)
DAG in Spark
-
8/20/2019 Spark UCSD 16Nov15 Zonca
44/49
DAG in Spark
• nodes are RDDs• arrows are Transformations
Narrow vs Wid
-
8/20/2019 Spark UCSD 16Nov15 Zonca
45/49
Narrow vs Wid
map
K A, 1
A A, 2m
groupbyK
Narrow vs Wid
-
8/20/2019 Spark UCSD 16Nov15 Zonca
46/49
Narrow vs Wid
map
K A, 1
A A, 2m
groupbyK
A, 1
A, 2
Spark DAG of transformat
-
8/20/2019 Spark UCSD 16Nov15 Zonca
47/49
Spark DAG of transformat
K A, 1
A A, 2m
string
reducebyK
A, 1
A, 2
(k, v) HDFS
maptextFile
KA 1
A 1
maptextFile join
-
8/20/2019 Spark UCSD 16Nov15 Zonca
48/49
K A, 1
A A,2m
string
A, 1
A, 2
(k, v1) HDFS
maptextFile
K A, 1
(k,(v1, v2))
A, 1
A, 2
(k, v2)
HDFSstring (k, v1 * v2)
map
Thank you
-
8/20/2019 Spark UCSD 16Nov15 Zonca
49/49
Thank you