Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC...
Transcript of Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC...
![Page 1: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/1.jpg)
Matei Zaharia
Fast and Expressive Big Data Analytics with Python
UC BERKELEY
spark-project.org UC Berkeley / MIT
![Page 2: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/2.jpg)
What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
Improves efficiency through: » In-memory computing primitives » General computation graphs
Improves usability through: » Rich APIs in Scala, Java, Python » Interactive shell
Up to 100× faster (2-10× on disk)
Often 5× less code
![Page 3: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/3.jpg)
Project History Started in 2009, open sourced 2010
17 companies now contributing code » Yahoo!, Intel, Adobe, Quantifind, Conviva, Bizo, …
Entered Apache incubator in June
Python API added in February
![Page 4: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/4.jpg)
An Expanding Stack Spark is the basis for a wide set of projects in the Berkeley Data Analytics Stack (BDAS)
Spark
Spark Streaming
(real-time)
GraphX (graph)
…
Shark (SQL)
MLbase (machine learning)
More details: amplab.berkeley.edu
![Page 5: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/5.jpg)
This Talk Spark programming model
Examples
Demo
Implementation
Trying it out
![Page 6: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/6.jpg)
Why a New Programming Model?
MapReduce simplified big data processing, but users quickly found two problems:
Programmability: tangle of map/red functions
Speed: MapReduce inefficient for apps that share data across multiple steps » Iterative algorithms, interactive queries
![Page 7: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/7.jpg)
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFS read
HDFS write
HDFS read
HDFS write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFS read
Slow due to data replication and disk I/O
![Page 8: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/8.jpg)
iter. 1 iter. 2 . . .
Input
Distributed memory
Input
query 1
query 2
query 3
. . .
one-time processing
10-100× faster than network and disk
What We’d Like
![Page 9: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/9.jpg)
Spark Model Write programs in terms of transformations on distributed datasets
Resilient Distributed Datasets (RDDs) » Collections of objects that can be stored in
memory or disk across a cluster » Built via parallel transformations (map, filter, …) » Automatically rebuilt on failure
![Page 10: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/10.jpg)
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) !errors = lines.filter(lambda s: s.startswith(“ERROR”)) !messages = errors.map(lambda s: s.split(“\t”)[2]) !messages.cache() !
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “foo” in s).count() !messages.filter(lambda s: “bar” in s).count()!
. . . !
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
Action
Result: full-text search of Wikipedia in 2 sec (vs 30 s for on-disk data)
Result: scaled to 1 TB data in 7 sec (vs 180 sec for on-disk data)
![Page 11: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/11.jpg)
Fault Tolerance RDDs track the transformations used to build them (their lineage) to recompute lost data
messages = textFile(...).filter(lambda s: “ERROR” in s) ! .map(lambda s: s.split(“\t”)[2]) ! !
HadoopRDD path = hdfs://…
FilteredRDD func = lambda s: …
MappedRDD func = lambda s: …
![Page 12: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/12.jpg)
Example: Logistic Regression
Goal: find line separating two sets of points
+
–
+ + +
+
+
+ + +
– – –
–
–
– – –
+
target
–
random initial line
![Page 13: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/13.jpg)
Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache() !!w = numpy.random.rand(D) !!for i in range(iterations): ! gradient = data.map(lambda p: ! (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x! ).reduce(lambda x, y: x + y) ! w -= gradient !!print “Final w: %s” % w !
![Page 14: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/14.jpg)
Logistic Regression Performance
0 500
1000 1500 2000 2500 3000 3500 4000
1 5 10 20 30
Runn
ing
Tim
e (s
)
Number of Iterations
Hadoop PySpark
110 s / iteration
first iteration 80 s further iterations 5 s
![Page 15: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/15.jpg)
Demo
![Page 16: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/16.jpg)
Supported Operators map !
filter !
groupBy!
union !
join !
leftOuterJoin!
rightOuterJoin!
reduce !
count !
fold !
reduceByKey!
groupByKey!
cogroup!
flatMap!
take !
first !
partitionBy!
pipe !
distinct !
save !
... !
![Page 17: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/17.jpg)
Other Engine Features General operator graphs (not just map-reduce)
Hash-based reduces (faster than Hadoop’s sort)
Controlled data partitioning to save communication
171
72
23
0
50
100
150
200
Itera
tion
time
(s)
PageRank Performance
Hadoop
Basic Spark
Spark + Controlled Partitioning
![Page 18: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/18.jpg)
1000+ meetup members
60+ contributors
17 companies contributing
Spark Community
![Page 19: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/19.jpg)
This Talk Spark programming model
Examples
Demo
Implementation
Trying it out
![Page 20: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/20.jpg)
Overview Spark core is written in Scala
PySpark calls existing scheduler, cache and networking layer (2K-line wrapper)
No changes to Python
Your app Spark
client
Spark worker
Python child
Python child
PySp
ark
Spark worker
Python child
Python child
![Page 21: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/21.jpg)
Overview Spark core is written in Scala
PySpark calls existing scheduler, cache and networking layer (2K-line wrapper)
No changes to Python
Your app Spark
client
Spark worker
Python child
Python child Py
Spar
k
Spark worker
Python child
Python child
Main PySpark author:
Josh Rosen
cs.berkeley.edu/~joshrosen
![Page 22: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/22.jpg)
Object Marshaling Uses pickle library for both communication and cached data » Much cheaper than Python objects in RAM
Lambda marshaling library by PiCloud
![Page 23: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/23.jpg)
Job Scheduler Supports general operator graphs
Automatically pipelines functions
Aware of data locality and partitioning join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition
![Page 24: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/24.jpg)
Interoperability Runs in standard CPython, on Linux / Mac » Works fine with extensions, e.g. NumPy
Input from local file system, NFS, HDFS, S3 » Only text files for now
Works in IPython, including notebook
Works in doctests – see our tests!
![Page 25: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/25.jpg)
Getting Started Visit spark-project.org for video tutorials, online exercises, docs
Easy to run in local mode (multicore), standalone clusters, or EC2
Training camp at Berkeley in August (free video): ampcamp.berkeley.edu
![Page 26: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/26.jpg)
Getting Started Easiest way to learn is the shell:
$ ./pyspark!
>>> nums = sc.parallelize([1,2,3]) # make RDD from array !
>>> nums.count() !3 !
>>> nums.map(lambda x: 2 * x).collect() ![2, 4, 6] !
![Page 27: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/27.jpg)
Writing Standalone Jobs
from pyspark import SparkContext!!if __name__ == "__main__": ! sc = SparkContext(“local”, “WordCount”) ! lines = sc.textFile(“in.txt”) ! ! counts = lines.flatMap(lambda s: s.split()) \ ! .map(lambda word: (word, 1)) \ ! .reduceByKey(lambda x, y: x + y) !! counts.saveAsTextFile(“out.txt”) !!!
![Page 28: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/28.jpg)
Conclusion PySpark provides a fast and simple way to analyze big datasets from Python
Learn more or contribute at spark-project.org
Look for our training camp on August 29-30!
My email: [email protected]
![Page 29: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/29.jpg)
Behavior with Not Enough RAM
68.8
58.1
40.7
29.7
11.5
0
20
40
60
80
100
Cache disabled
25% 50% 75% Fully cached
Iter
atio
n tim
e (s
)
% of working set in memory
![Page 30: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/30.jpg)
The Rest of the Stack Spark is the foundation for wide set of projects in the Berkeley Data Analytics Stack (BDAS)
Spark
Spark Streaming
(real-time)
GraphX (graph)
…
Shark (SQL)
MLbase (machine learning)
More details: amplab.berkeley.edu
![Page 31: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive](https://reader030.fdocuments.net/reader030/viewer/2022040214/5ec5582613b08355f20aa11f/html5/thumbnails/31.jpg)
Performance Comparison Im
pal
a (d
isk)
Imp
ala
(mem
)
Red
shift
Shar
k (d
isk)
Shar
k (m
em)
0
5
10
15
20
25
Resp
ons
e T
ime
(s)
SQL
Sto
rm
Spar
k
0
5
10
15
20
25
30
35
Thro
ughp
ut (M
B/s
/no
de)
Streaming
Had
oo
p
Gira
ph
Gra
phL
ab
Gra
phX
0
5
10
15
20
25
30
Resp
ons
e T
ime
(min
) Graph