Zaharia spark-scala-days-2012

download Zaharia spark-scala-days-2012

of 31

  • date post

  • Category


  • view

  • download


Embed Size (px)



Transcript of Zaharia spark-scala-days-2012

  • 1. Spark in ActionFast Big Data Analytics using ScalaMatei ZahariaUniversity of California, UC BERKELEY

2. My BackgroundGrad student in the AMP Lab at UC Berkeley 50-person lab focusing on big dataCommitter on Apache HadoopStarted Spark in 2009 to provide a richer,Hadoop-compatible computing engine 3. Spark GoalsExtend the MapReduce model to support moretypes of applications efficiently Spark can run 40x faster than Hadoop for iterative and interactive applicationsMake jobs easier to program Language-integrated API in Scala Interactive use from Scala interpreter 4. Why go Beyond MapReduce?MapReduce simplified big data analysis by givinga reliable programming model for large clustersBut as soon as it got popular, users wanted more: More complex, multi-stage applications More interactive ad-hoc queries 5. Why go Beyond MapReduce?Complex jobs and interactive queries both needone thing that MapReduce lacks: Efficient primitives for data sharing Query 1Stage 3 Stage 2 Stage 1 Query 2 Query 3Iterative algorithm Interactive data mining 6. Why go Beyond MapReduce?Complex jobs and interactive queries both needone thing that MapReduce lacks: Efficient primitives for data sharing Query 1Stage 3 Stage 2 Stage 1In MapReduce, the only way to shareQuery 2 acrossdatajobs is stable storage (e.g. HDFS) Query 3 -> slow!Iterative algorithm Interactive data mining 7. ExamplesHDFS HDFS HDFS HDFSread writeread write iter. 1 iter. 2 . . .Input HDFSquery 1result 1 read query 2result 2 query 3result 3Input. . . I/O and serialization can take 90% of the time 8. Goal: In-Memory Data Sharingiter. 1 iter. 2 . . .Inputquery 1 one-timeprocessingquery 2query 3Input Distributed memory. . . 10-100 faster than network and disk 9. Solution: ResilientDistributed Datasets (RDDs)Distributed collections of objects that can bestored in memory for fast reuseAutomatically recover lost data on failureSupport a wide range of applications 10. OutlineSpark programming modelUser applicationsImplementationDemoWhats next 11. Programming ModelResilient distributed datasets (RDDs) Immutable, partitioned collections of objects Can be cached in memory for efficient reuseTransformations (e.g. map, filter, groupBy, join) Build RDDs from other RDDsActions (e.g. count, collect, save) Return a result or write it to storage 12. Example: Log Mining Load error messages from a log into memory, then interactively search for various patternsBaseTransformed RDD RDD Cache 1lines = spark.textFile(hdfs://...) Worker resultserrors = lines.filter(_.startsWith(ERROR))messages = 1DrivercachedMsgs = messages.cache()ActioncachedMsgs.filter(_.contains(foo)).countcachedMsgs.filter(_.contains(bar)).countCache 2 Worker. . .Cache 3 WorkerBlock 2 Result: scaled tosearch of Wikipedia full-text 1 TB data in 5-7 sec in (word, 1)) .reduceByWindow(5, _ + _) T=2 28. ConclusionSpark offers a simple, efficient and powerfulprogramming model for a wide range of appsShark and Spark Streaming coming soonDownload and docs: @matei_zaharia / 29. Related WorkDryadLINQ Build queries through language-integrated SQL operations on lazy datasets Cannot have a dataset persist across queriesRelational databases Lineage/provenance, logical logging, materialized viewsPiccolo Parallel programs with shared distributed hash tables; similar to distributed shared memoryIterative MapReduce (Twister and HaLoop) Cannot define multiple distributed datasets, run different map/reduce pairs on them, or query data interactively 30. Related WorkDistributed shared memory (DSM) Very general model allowing random reads/writes, but hard to implement efficiently (needs logging or checkpointing)RAMCloud In-memory storage system for web applications Allows random reads/writes and uses logging like DSMNectar Caching system for DryadLINQ programs that can reuse intermediate results across jobs Does not provide caching in memory, explicit support over which data is cached, or control over partitioningSMR (functional Scala API for Hadoop) 31. Behavior with Not Enough RAM 10068.8Iteration time (s)58.1 80 40.7 6029.7 40 11.5 200Cache 25% 50%75%Fully disabledcached% of working set in cache