Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data...
Transcript of Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data...
![Page 1: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/1.jpg)
www.cwi.nl/~boncz/bads
Big DataInfrastructures & TechnologiesHadoop Streaming Revisit
![Page 2: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/2.jpg)
www.cwi.nl/~boncz/bads
ENRON Mapper
![Page 4: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/4.jpg)
www.cwi.nl/~boncz/bads
ENRON Reducer
Optimization?
![Page 5: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/5.jpg)
www.cwi.nl/~boncz/bads
ENRON Final Results (Excerpt)
![Page 6: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/6.jpg)
www.cwi.nl/~boncz/bads
The Spark Framework
credits: Matei Zaharia & Xiangrui Meng
![Page 7: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/7.jpg)
www.cwi.nl/~boncz/bads
What is Spark?• Fast and expressive cluster computing system interoperable with
Apache Hadoop• Improves efficiency through:
– In-memory computing primitives– General computation graphs
• Improves usability through:– Rich APIs in Scala, Java, Python– Interactive shell
Up to 100× faster(2-10× on disk)
Often 5× less code
credits: Matei Zaharia & Xiangrui Meng
![Page 8: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/8.jpg)
www.cwi.nl/~boncz/bads
The Spark Stack
• Spark is the basis of a wide set of projects in the Berkeley Data Analytics Stack (BDAS)
Spark
Spark Streaming
(real-time)
GraphX(graph)
…
Spark SQL
MLIB(machine learning)
More details: amplab.berkeley.educredits: Matei Zaharia & Xiangrui Meng
![Page 9: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/9.jpg)
www.cwi.nl/~boncz/bads
Why a New Programming Model?
• MapReduce greatly simplified big data analysis• But as soon as it got popular, users wanted more:
– More complex, multi-pass analytics (e.g. ML, graph)– More interactive ad-hoc queries– More real-time stream processing
• All 3 need faster data sharing across parallel jobs
credits: Matei Zaharia & Xiangrui Meng
![Page 10: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/10.jpg)
www.cwi.nl/~boncz/bads
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFSread
Slow due to replication, serialization, and disk IOcredits: Matei Zaharia & Xiangrui Meng
![Page 11: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/11.jpg)
www.cwi.nl/~boncz/bads
iter. 1 iter. 2 . . .
Input
Data Sharing in Spark
Distributedmemory
Input
query 1
query 2
query 3
. . .
one-timeprocessing
~10× faster than network and diskcredits: Matei Zaharia & Xiangrui Meng
![Page 12: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/12.jpg)
www.cwi.nl/~boncz/bads
Spark Programming Model• Key idea: resilient distributed datasets (RDDs)
– Distributed collections of objects that can be cached in memory across the cluster
– Manipulated through parallel operators– Automatically recomputed on failure
• Programming interface– Functional APIs in Scala, Java, Python– Interactive use from Scala shell
credits: Matei Zaharia & Xiangrui Meng
![Page 13: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/13.jpg)
www.cwi.nl/~boncz/bads
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache()
Base RDDTransformed RDD
Worker
Worker
Worker
Driver
credits: Matei Zaharia & Xiangrui Meng
![Page 14: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/14.jpg)
www.cwi.nl/~boncz/bads
Lambda Functions
Lambda function ç functional programming!
= implicit function definition
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
bool detect_error(string x) {
return x.startswith(“ERROR”);
}
![Page 15: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/15.jpg)
www.cwi.nl/~boncz/badscredits: Matei Zaharia & Xiangrui Meng
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
messages.filter(lambda x: “bar” in x).count
. . .
tasks
resultsCache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)
![Page 16: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/16.jpg)
www.cwi.nl/~boncz/bads
Fault Tolerance
• file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10)
filterreducemap
Inpu
t file
RDDs track lineage info to rebuild lost data
credits: Matei Zaharia & Xiangrui Meng
![Page 17: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/17.jpg)
www.cwi.nl/~boncz/bads
filterreducemap
Inpu
t file
• file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10)
RDDs track lineage info to rebuild lost data
Fault Tolerance
credits: Matei Zaharia & Xiangrui Meng
![Page 18: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/18.jpg)
www.cwi.nl/~boncz/bads
Example: Logistic Regression
0500
1000150020002500300035004000
1 5 10 20 30
Run
ning
Tim
e (s
)
Number of Iterations
HadoopSpark
110 s / iteration
first iteration 80 sfurther iterations 1 s
credits: Matei Zaharia & Xiangrui Meng
![Page 19: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/19.jpg)
www.cwi.nl/~boncz/bads
Example: Logistic Regression
0500
1000150020002500300035004000
1 5 10 20 30
Run
ning
Tim
e (s
)
Number of Iterations
HadoopSpark
110 s / iteration
first iteration 80 sfurther iterations 1 s
credits: Matei Zaharia & Xiangrui Meng
![Page 20: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/20.jpg)
www.cwi.nl/~boncz/bads
Spark in Scala and Java
// Scala:
val lines = sc.textFile(...)lines.filter(x => x.contains(“ERROR”)).count()
// Java:
JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {return s.contains(“error”);
}}).count();
credits: Matei Zaharia & Xiangrui Meng
![Page 21: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/21.jpg)
www.cwi.nl/~boncz/bads
Supported Operators
•map
•filter
•groupBy
•sort
•union
•join
•leftOuterJoin
•rightOuterJoin
•reduce
•count
•fold
•reduceByKey
•groupByKey
•cogroup
•cross
•zip
sample
take
first
partitionBy
mapWith
pipe
save
...credits: Matei Zaharia & Xiangrui Meng
![Page 22: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/22.jpg)
www.cwi.nl/~boncz/bads
Software Components
• Spark client is library in user program (1 instance per app)
• Runs tasks locally or on cluster– Mesos, YARN, standalone mode
• Accesses storage systems via Hadoop InputFormat API
– Can use HBase, HDFS, S3, …
Your application
SparkContext
Local threads
Cluster manager
WorkerSpark
executor
WorkerSpark
executor
HDFS or other storagecredits: Matei Zaharia & Xiangrui Meng
![Page 23: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/23.jpg)
www.cwi.nl/~boncz/bads
Task Scheduler
General task graphsAutomatically pipelines functionsData locality awarePartitioning awareto avoid shuffles
= cached partition= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
credits: Matei Zaharia & Xiangrui Meng
![Page 24: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/24.jpg)
www.cwi.nl/~boncz/bads
Spark SQL• Columnar SQL analytics engine for Spark
– Support both SQL and complex analytics– Columnar storage, JIT-compiled execution, Java/Scala/Python UDFs– Catalyst query optimizer (also for DataFrame scripts)
credits: Matei Zaharia & Xiangrui Meng
![Page 25: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/25.jpg)
www.cwi.nl/~boncz/bads
Hive Architecture
Hive Catalog
HDFS
Client
Driver
SQL Parser
Query Optimizer
Physical Plan
Execution
CLI JDBC
MapReduce
credits: Matei Zaharia & Xiangrui Meng
![Page 26: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/26.jpg)
www.cwi.nl/~boncz/bads
Spark SQL Architecture
Hive Catalog
HDFS
Client
Driver
SQL Parser
Physical Plan
Execution
CLI JDBC
Spark
Cache Mgr.
CatalystQuery
Optimizer
[Engle et al, SIGMOD 2012]credits: Matei Zaharia & Xiangrui Meng
![Page 27: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/27.jpg)
www.cwi.nl/~boncz/bads
From RDD to DataFrame
• A distributed collection of rows with the same schema (RDDs suffer from type erasure)
• Can be constructed from external data sources or RDDs into essentially an RDD of Row objects (SchemaRDDs as of Spark < 1.3)
• Supports relational operators (e.g. where, groupby) as well as Spark operations.
• Evaluated lazily à non-materialized logical plan
credits: Matei Zaharia & Reynold Xin
![Page 28: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/28.jpg)
www.cwi.nl/~boncz/bads
DataFrame: Data Model• Nested data model• Supports both primitive SQL types (boolean, integer, double, decimal,
string, data, timestamp) and complex types (structs, arrays, maps, and unions); also user defined types.
• First class support for complex data types
credits: Matei Zaharia & Reynold Xin
![Page 29: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/29.jpg)
www.cwi.nl/~boncz/bads
DataFrame Operations
• Relational operations (select, where, join, groupBy) via a DSL• Operators take expression objects• Operators build up an abstract syntax tree (AST), which is then
optimized by Catalyst.
• Alternatively, register as temp SQL table and perform traditional SQL query strings
credits: Matei Zaharia & Reynold Xin
![Page 30: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/30.jpg)
www.cwi.nl/~boncz/bads
Catalyst: Plan Optimization & Execution
SQL AST
DataFrame
Unresolved Logical Plan Logical Plan Optimized
Logical Plan RDDsSelected Physical Plan
Analysis LogicalOptimization
PhysicalPlanning
Cos
t Mod
el
Physical Plans
CodeGeneration
Catalog
credits: Matei Zaharia & Reynold Xin
![Page 31: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/31.jpg)
www.cwi.nl/~boncz/bads
Catalyst Optimization Rules
Add
Attribute(x) Literal(3)
x + (1 + 2) x + 3credits: Matei Zaharia & Reynold Xin
• Applies standard rule-based optimization (constant folding, predicate-pushdown, projection pruning, null propagation, boolean expression simplification, etc)
Logical Plan Optimized Logical Plan
LogicalOptimization
![Page 32: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/32.jpg)
www.cwi.nl/~boncz/bads
32
def add_demographics(events):u = sqlCtx.table("users") # Load partitioned Hive tableevents \
.join(u, events.user_id == u.user_id) \ # Join on user_id
.withColumn("city", zipToCity(df.zip)) # Run udf to add city column
Physical Planwith Predicate Pushdown
and Column Pruning
join
optimized scan
(events)
optimizedscan
(users)
events = add_demographics(sqlCtx.load("/data/events", "parquet")) training_data = events.where(events.city == "Melbourne").select(events.timestamp).collect()
Logical Plan
filter
join
events file users table
Physical Planjoin
scan(events) filter
scan(users)
credits: Matei Zaharia & Reynold Xin
![Page 33: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/33.jpg)
www.cwi.nl/~boncz/bads
An Example Catalyst Transformation
33
1. Find filters on top of projections.
2. Check that the filter can be evaluated without the result of the project.
3. If so, switch the operators.
Projectname
Projectid,name
Filterid = 1
People
OriginalPlan
Projectname
Projectid,name
Filterid = 1
People
FilterPush-Down
credits: Matei Zaharia & Reynold Xin
![Page 34: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/34.jpg)
www.cwi.nl/~boncz/bads
Other Spark Stack Projects
We will revisit Spark SQL in the SQL on Big Data lecture
• Structured Streaming: stateful, fault-tolerant stream
–sc.twitterStream(...).flatMap(_.getText.split(“ ”)).map(word => (word, 1)).reduceByWindow(“5s”, _ + _)
– we will revisit structured streaming in the Data Streaming lecture
this lecture, still:• GraphX & GraphFrames: graph-processing framework• MLlib: Library of high-quality machine learning algorithms
credits: Matei Zaharia & Xiangrui Meng
![Page 35: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/35.jpg)
www.cwi.nl/~boncz/bads
PerformanceIm
pala
(dis
k)
Impa
la (m
em)
Reds
hift
Spar
k SQ
L (d
isk)
Spar
k SQ
L (m
em)
0
5
10
15
20
25
Res
pons
e Ti
me
(s)
SQL
Stor
m
Spar
k
0
5
10
15
20
25
30
35
Thro
ughp
ut (M
B/s
/nod
e)
Streaming
Had
oop
Gira
ph
Gra
phX
0
5
10
15
20
25
30
Res
pons
e Ti
me
(min
)Graphcredits:
Matei Zaharia & Xiangrui Meng
![Page 36: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/36.jpg)
www.cwi.nl/~boncz/bads
THINK AS A VERTEX:GRAPH PROGRAMMING
PREGEL, GIRAPH, GRAPHX
![Page 37: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/37.jpg)
www.cwi.nl/~boncz/bads
Graphs are Simple
![Page 38: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/38.jpg)
www.cwi.nl/~boncz/bads
A Computer Network
![Page 39: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/39.jpg)
www.cwi.nl/~boncz/bads
A Social Network
![Page 40: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/40.jpg)
www.cwi.nl/~boncz/bads
Maps are Graphs as well
![Page 41: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/41.jpg)
www.cwi.nl/~boncz/bads
PageRank in MapReduce• Record: < v_i, pr, [ v_j, ..., v_k ] >• Mapper: emits < v_j, pr / #neighbours >• Reducer: sums the partial values
![Page 42: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/42.jpg)
www.cwi.nl/~boncz/bads
MapReduce DataFlow
• Each job is executed N times• Job bootstrap• Mappers send PR values and structure• Extensive IO at input, shuffle & sort, output
![Page 43: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/43.jpg)
www.cwi.nl/~boncz/bads
Pregel: computational model• Based on Bulk Synchronous Parallel (BSP)
– Computational units encoded in a directed graph– Computation proceeds in a series of supersteps– Message passing architecture
• Each vertex, at each superstep:– Receives messages directed at it from previous superstep– Executes a user-defined function (modifying state)– Emits messages to other vertices (for the next superstep)
• Termination:– A vertex can choose to deactivate itself– Is “woken up” if new messages received– Computation halts when all vertices are inactive
Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.
![Page 44: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/44.jpg)
www.cwi.nl/~boncz/bads
Pregel
superstep t
superstep t+1
superstep t+2
Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.
![Page 45: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/45.jpg)
www.cwi.nl/~boncz/bads
Pregel: implementation• Master-Slave architecture
– Vertices are hash partitioned (by default) and assigned to workers– Everything happens in memory
• Processing cycle– Master tells all workers to advance a single superstep– Worker delivers messages from previous superstep, executing vertex
computation– Messages sent asynchronously (in batches)– Worker notifies master of number of active vertices
• Fault tolerance– Checkpointing– Heartbeat/revert
Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.
![Page 46: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/46.jpg)
www.cwi.nl/~boncz/bads
Vertex-centric API
![Page 47: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/47.jpg)
www.cwi.nl/~boncz/bads
Shortest Paths
![Page 48: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/48.jpg)
www.cwi.nl/~boncz/bads
Shortest Paths
![Page 49: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/49.jpg)
www.cwi.nl/~boncz/bads
Shortest Paths
![Page 50: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/50.jpg)
www.cwi.nl/~boncz/bads
Shortest Paths
![Page 51: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/51.jpg)
www.cwi.nl/~boncz/bads
Shortest Paths
![Page 52: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/52.jpg)
www.cwi.nl/~boncz/bads
Shortest Pathsdef compute(vertex, messages):
minValue = Inf # float(‘Inf’)for m in messages:
minValue = min(minValue, m)if minValue < vertex.getValue():
vertex.setValue(minValue)for edge in vertex.getEdges():
message = minValue + edge.getValue()sendMessage(edge.getTargetId(), message)
vertex.voteToHalt()
![Page 53: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/53.jpg)
www.cwi.nl/~boncz/bads
Graph Programming Frameworks• Google Pregel
– Non open-source, probably not used much anymore• Apache Giraph
– Developed and used by Facebook• Apache Flink
– Gelly API• Apache Spark
– GraphX API+– DataFrames API
Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.
![Page 54: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/54.jpg)
www.cwi.nl/~boncz/bads
SPARK MLLIB
credits: Matei Zaharia & Xiangrui Meng
![Page 55: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/55.jpg)
www.cwi.nl/~boncz/bads
What is MLLIB?MLlib is a Spark subproject providing machine learning primitives: • initial contribution from AMPLab, UC Berkeley • shipped with Spark since version 0.8
credits: Matei Zaharia & Xiangrui Meng
![Page 56: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/56.jpg)
www.cwi.nl/~boncz/bads
What is MLLIB?Algorithms: • classification: logistic regression, linear support vector machine (SVM),
naive Bayes• regression: generalized linear regression (GLM) • collaborative filtering: alternating least squares (ALS) • clustering: k-means • decomposition: singular value decomposition (SVD), principal
component analysis (PCA)
credits: Matei Zaharia & Xiangrui Meng
![Page 57: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/57.jpg)
www.cwi.nl/~boncz/bads
Collaborative Filtering
credits: Matei Zaharia & Xiangrui Meng
![Page 58: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/58.jpg)
www.cwi.nl/~boncz/bads
Alternating Least Squares (ALS)
credits: Matei Zaharia & Xiangrui Meng
![Page 59: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/59.jpg)
www.cwi.nl/~boncz/bads
Collaborative Filtering in Spark MLLIBtrainset = sc.textFile("s3n://bads-music-dataset/train_*.gz").map(lambda l: l.split('\t')).map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2])))
model = ALS.train(trainset, rank=10, iterations=10) # train
testset = # load testing setsc.textFile("s3n://bads-music-dataset/test_*.gz").map(lambda l: l.split('\t')).map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2])))
# apply model to testing set (only first two cols) to predictpredictions =
model.predictAll(testset.map(lambda p: (p[0], p[1]))).map(lambda r: ((r[0], r[1]), r[2]))
credits: Matei Zaharia & Xiangrui Meng
![Page 60: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/60.jpg)
www.cwi.nl/~boncz/bads
Spark MLLIB – ALS Performance
System Wall-clock /me (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481• Dataset: Netflix data • Cluster: 9 machines. • MLlib is an order of magnitude faster than Mahout. • MLlib is within factor of 2 of GraphLab.
credits: Matei Zaharia & Xiangrui Meng
![Page 61: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/61.jpg)
www.cwi.nl/~boncz/bads
Spark Implementation of ALS• Workers load data • Models are instantiated
at workers. • At each iteration, models
are shared via join between workers.
• Good scalability.• Works on large datasets
Master
Workerscredits: Matei Zaharia & Xiangrui Meng
![Page 62: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/62.jpg)
www.cwi.nl/~boncz/bads
What it Means for Users
• Separate frameworks:
…HDFS read
HDFS writeET
L HDFS read
HDFS writetra
in HDFS read
HDFS writequ
ery
HDFS
HDFS read ET
Ltra
inqu
ery
Spark:
credits: Matei Zaharia & Xiangrui Meng
![Page 63: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/63.jpg)
www.cwi.nl/~boncz/bads
Summary• The Spark Framework
– Generalize Map(),Reduce() to a much larger set of operations• Join, filter, group-by, …è closer to database queries
– High(er) performance (than MapReduce)• In-memory caching, catalyst query optimizer, JIT compilation, ..• RDDs è DataFrames
• Spark GraphX: Graph Analytics (similar to Pregel/Giraph/Gelly)– Graph algorithms are often iterative (multi-job) è a pain in MapReduce– Vertex-centric programming model:
• Who to send messages to (halt if none)• How to compute new vertex state from messages
• Spark MLlib: scalable Machine learning– classification, regression, ALS, k-means, decomposition– Parallel DataFrame operations: allows analyzing data > RAM
![Page 64: Big Data Infrastructures & Technologiesboncz/bads/03-Spark.pdf · boncz/bads Big Data Infrastructures & Technologies HadoopStreaming Revisit](https://reader033.fdocuments.net/reader033/viewer/2022042210/5eadca156f67a058d16baa7b/html5/thumbnails/64.jpg)
www.cwi.nl/~boncz/bads
Conclusion
• Big data analytics is evolving to include:– More complex analytics (e.g. machine learning)– More interactive ad-hoc queries– More real-time stream processing
• Spark is a fast platform that unifies these apps• More info: spark-project.org
credits: Matei Zaharia & Xiangrui Meng