Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20....
Transcript of Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20....
![Page 1: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/1.jpg)
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica
Spark Fast, Interactive, Language-‐Integrated Cluster Computing
UC Berkeley
![Page 2: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/2.jpg)
Background MapReduce and its variants greatly simplified big data analytics by hiding scaling and faults
However, these systems provide a restricted programming model
Can we design similarly powerful abstractions for a broader class of applications?
![Page 3: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/3.jpg)
Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage
Map
Map
Map
Reduce
Reduce
Input Output
![Page 4: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/4.jpg)
Motivation
Map
Map
Map
Reduce
Reduce
Input Output
Benefits of data flow: runtime can decide where to run tasks and can automatically
recover from failures
Most current cluster programming models are based on acyclic data flow from stable storage to stable storage
![Page 5: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/5.jpg)
Motivation Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: » Iterative algorithms (machine learning, graphs) » Interactive data mining tools (R, Excel, Python)
With current frameworks, apps reload data from stable storage on each query
![Page 6: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/6.jpg)
Spark Goal Efficiently support apps with working sets » Let them keep data in memory
Retain the attractive properties of MapReduce: » Fault tolerance (for crashes & stragglers) » Data locality » Scalability
Solution: extend data flow model with “resilient distributed datasets” (RDDs)
![Page 7: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/7.jpg)
Outline Spark programming model
Applications
Implementation
Demo
![Page 8: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/8.jpg)
Programming Model Resilient distributed datasets (RDDs) » Immutable, partitioned collections of objects » Created through parallel transformations (map, filter, groupBy, join, …) on data in stable storage » Can be cached for efficient reuse
Actions on RDDs » Count, reduce, collect, save, …
![Page 9: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/9.jpg)
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
Action
Result: full-‐text search of Wikipedia in <1 sec (vs 20 sec for on-‐disk data) Result: scaled to 1 TB data in 5-‐7 sec
(vs 170 sec for on-‐disk data)
![Page 10: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/10.jpg)
RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions
Ex:
cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache()
HdfsRDD path: hdfs://…
FilteredRDD func: contains(...)
MappedRDD func: split(…) CachedRDD
![Page 11: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/11.jpg)
Example: Logistic Regression
Goal: find best line separating two sets of points
+
–
+ + +
+
+
+ + +
– – –
–
–
– – –
+
target
–
random initial line
![Page 12: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/12.jpg)
Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)
![Page 13: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/13.jpg)
Logistic Regression Performance
0 500
1000 1500 2000 2500 3000 3500 4000 4500
1 5 10 20 30
Run
ning
Tim
e (s)
Number of Iterations
Hadoop
Spark
127 s / iteration
first iteration 174 s further iterations 6 s
![Page 14: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/14.jpg)
Spark Applications Twitter spam classification (Monarch)
In-‐memory analytics on Hive data (Conviva)
Traffic prediction using EM (Mobile Millennium)
K-‐means clustering
Alternating Least Squares matrix factorization
Network simulation
![Page 15: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/15.jpg)
Conviva GeoReport
Aggregations on many group keys w/ same WHERE clause
Gains come from: » Not re-‐reading unused columns » Not re-‐reading filtered records » Not repeated decompression » In-‐memory storage of deserialized objects
0.5
20
0 5 10 15 20
Spark
Hive
Time (hours)
![Page 16: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/16.jpg)
Generality of RDDs RDDs can efficiently express many proposed cluster programming models » MapReduce => map and reduceByKey operations » Dryad => Spark runs general DAGs of tasks » Pregel iterative graph processing => Bagel » SQL => Hive on Spark (Hark?)
Can also express apps that neither of these can
![Page 17: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/17.jpg)
Implementation Spark runs on the Mesos cluster manager, letting it share resources with Hadoop & other apps
Can read from any Hadoop input source (e.g. HDFS)
Spark Hadoop MPI
Mesos
Node Node Node Node
…
~7000 lines of code; no changes to Scala compiler
![Page 18: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/18.jpg)
Language Integration Scala closures are Serializable Java objects » Serialize on master, load & run on workers
Not quite enough » Nested closures may reference entire outer scope » May pull in non-‐Serializable variables not used inside » Solution: bytecode analysis + reflection
Other tricks using custom serialized forms (e.g. “accumulators” as syntactic sugar for counters)
![Page 19: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/19.jpg)
Interactive Spark Modified Scala interpreter to allow Spark to be used interactively from the command line
Required two changes: » Modified wrapper code generation so that each line typed has references to objects for its dependencies » Distribute generated classes over the network
Enables in-‐memory exploration of big data
![Page 20: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/20.jpg)
Demo
![Page 21: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/21.jpg)
Conclusion Spark’s resilient distributed datasets are a simple programming model for a wide range of apps
Download our open source release at
www.spark-‐project.org
![Page 22: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/22.jpg)
Related Work DryadLINQ » Build queries through language-‐integrated SQL operations on lazy datasets » Cannot have a dataset persist across queries
Relational databases » Lineage/provenance, logical logging, materialized views
Piccolo » Parallel programs with shared distributed hash tables; similar to distributed shared memory
Iterative MapReduce (Twister and HaLoop) » Cannot define multiple distributed datasets, run different map/reduce pairs on them, or query data interactively
![Page 23: Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6e453090d1087d837ef1ab/html5/thumbnails/23.jpg)
Related Work Distributed shared memory (DSM) » Very general model allowing random reads/writes, but hard to implement efficiently (needs logging or checkpointing)
RAMCloud » In-‐memory storage system for web applications » Allows random reads/writes and uses logging like DSM
Nectar » Caching system for DryadLINQ programs that can reuse intermediate results across jobs » Does not provide caching in memory, explicit support over which data is cached, or control over partitioning
SMR (functional Scala API for Hadoop)