Apache Spark RDDs

download Apache Spark RDDs

of 35

Embed Size (px)


Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala. http://www.meetup.com/Scala-Bay/events/209740892/

Transcript of Apache Spark RDDs

  • 1. Apache Spark RDDsDean CheneBay Inc.

2. http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf 3. Spark 2010 paper Berkley's AMPLab resilient distributed datasets (RDDs) Generalized distributed computation engine/platform Fault tolerant in memory caching Extensible interface for various work loads 4. http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf 5. https://amplab.cs.berkeley.edu/software/ 6. RDDs Resilient distributed datasets "read-only collection of objects partitionedacross a set of machines that can be rebuilt if apartition is lost" Familiar Scala collections API for distributed dataand computation Monadic expression of lazy transformations, butnot monads 7. Spark Shell Interactive queries and prototyping Local, YARN, Mesos Static type checking and auto complete Lambdas 8. val titles = sc.textFile("titles.txt")val countsRdd = titles.flatMap(tokenize).map(word => (cleanse(word), 1)).reduceByKey(_ + _)val counts = countsRdd.filter{case(_, total) => total > 10000}.sortBy{case(_, total) => total}.filter{case(word, _) => word.length >= 5}.collect 9. TransformationsmapfilterflatMapsampleunionintersectiondistinctgroupByKeyreduceByKeysortByKeyjoincogroupcartesian 10. ActionsreducecollectcountfirsttaketakeSamplesaveAsTextFileforeach 11. val sqlContext =new org.apache.spark.sql.SQLContext(sc)import sqlContext._case class Count(word: String, total: Int)val schemaRdd =countsRdd.map(c => Count(c._1, c._2))val count = schemaRdd.where('word === "scala").select('total).collect 12. schemaRdd.registerTempTable("counts")sql("SELECT total FROM countsWHERE word = 'scala'").collectschemaRdd.filter(_.word == "scala").map(_.total).collect 13. registerFunction("LEN", (_: String).length)val queryRdd = sql("SELECT * FROM countsWHERE LEN(word) = 10ORDER BY total DESCLIMIT 10")queryRdd.map(c =>s"word: ${c(0)} t| total: ${c(1)}").collect().foreach(println) 14. Spark Streaming Realtime computation similar to Storm Input distributed to memory for fault tolerance Streaming input in to sliding windows of RDDs Kafka, Flume, Kinesis, HDFS 15. TwitterUtils.createStream().filter(_.getText.contains("Spark")).countByWindow(Seconds(5)) 16. GraphX Optimally partitions and indexes vertices andedges represented as RDDs APIs to join and traverse graphs PageRank, connected components, trianglecounting 17. val graph = Graph(userIdRDD, assocRDD)val ranks = graph.pageRank(0.0001).verticesval userRDD = sc.textFile("graphx/data/users.txt")val users = userRdd.map { line =>val fields = line.split(",")(fields(0).toLong, fields(1))}val ranksByUsername = users.join(ranks).map {case (id, (username, rank)) => (username, rank)} 18. MLib Machine learning library similar to Mahout Statistics, regression, decision trees, clustering,PCA, gradient descent Iterative algorithms much faster due to inmemory caching 19. val data = sc.textFile("data.txt")val parsedData = data.map { line =>val parts = line.split(',')LabeledPoint(parts(0).toDouble,Vectors.dense(parts(1).split(' ').map(_.toDouble)))}val model = LinearRegressionWithSGD.train(parsedData, 100)val valuesAndPreds = parsedData.map { point =>val prediction = model.predict(point.features)(point.label, prediction)}val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() 20. RDDs Resilient distributed datasets Familiar Scala collections API Distributed data and computation Monadic expression of transformations But not monads 21. Pseudo Monad Wraps iterator + partitions distribution Keeps track of history for fault tolerance Lazily evaluated, chaining of expressions 22. https://databricks-training.s3.amazonaws.com/slides/advanced-spark-training.pdf 23. RDD Interface compute: transformation applied to iterable(s) getPartitions: partition data for parallelcomputation getDependencies: lineage of parent RDDs and ifshuffle is required 24. HadoopRDD compute: read HDFS block or file split getPartitions: HDFS block or file split getDependencies: None 25. MappedRDD compute: compute parent and map result getPartitions: parent partition getDependencies: single dependency on parent 26. CoGroupedRDD compute: compute, shuffle then group parentRDDs getPartitions: one per reduce task getDependencies: shuffle each parent RDD 27. Summary Simple Unified API through RDDs Interactive Analysis Hadoop Integration Performance 28. References http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf https://www.youtube.com/watch?v=HG2Yd-3r4-M https://www.youtube.com/watch?v=e-Ys-2uVxM0 RDD, MappedRDD, SchemaRDD, RDDFunctions,GraphOps, DStream 29. deanchen5@gmail.com@deanchen