Download - Apache Spark RDDs

Transcript
Page 1: Apache Spark RDDs

Apache Spark RDDsDean Chen eBay Inc.

Page 2: Apache Spark RDDs

http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf

Page 3: Apache Spark RDDs

Spark• 2010 paper Berkley's AMPLab

• resilient distributed datasets (RDDs)

• Generalized distributed computation engine/platform

• Fault tolerant in memory caching

• Extensible interface for various work loads

Page 4: Apache Spark RDDs

http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf

Page 5: Apache Spark RDDs

https://amplab.cs.berkeley.edu/software/

Page 6: Apache Spark RDDs

RDDs• Resilient distributed datasets

• "read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost"

• Familiar Scala collections API for distributed data and computation

• Monadic expression of lazy transformations, but not monads

Page 7: Apache Spark RDDs

Spark Shell

• Interactive queries and prototyping

• Local, YARN, Mesos

• Static type checking and auto complete

• Lambdas

Page 8: Apache Spark RDDs
Page 9: Apache Spark RDDs

val titles = sc.textFile("titles.txt")

val countsRdd = titles .flatMap(tokenize) .map(word => (cleanse(word), 1)) .reduceByKey(_ + _)

val counts = countsRdd .filter{case(_, total) => total > 10000} .sortBy{case(_, total) => total} .filter{case(word, _) => word.length >= 5} .collect

Page 10: Apache Spark RDDs

Transformations

map filter flatMap sample union intersection

distinct groupByKey reduceByKey sortByKey join cogroup cartesian

Page 11: Apache Spark RDDs

Actions

reduce collect count first

take takeSample saveAsTextFile foreach

Page 12: Apache Spark RDDs
Page 13: Apache Spark RDDs

val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._

case class Count(word: String, total: Int)

val schemaRdd = countsRdd.map(c => Count(c._1, c._2))

val count = schemaRdd .where('word === "scala") .select('total) .collect

Page 14: Apache Spark RDDs

schemaRdd.registerTempTable("counts")

sql(" SELECT total FROM counts WHERE word = 'scala' ").collect

schemaRdd .filter(_.word == "scala") .map(_.total) .collect

Page 15: Apache Spark RDDs

registerFunction("LEN", (_: String).length) val queryRdd = sql(" SELECT * FROM counts WHERE LEN(word) = 10 ORDER BY total DESC LIMIT 10 ") queryRdd .map(c => s"word: ${c(0)} \t| total: ${c(1)}") .collect() .foreach(println)

Page 16: Apache Spark RDDs
Page 17: Apache Spark RDDs

Spark Streaming

• Realtime computation similar to Storm

• Input distributed to memory for fault tolerance

• Streaming input in to sliding windows of RDDs

• Kafka, Flume, Kinesis, HDFS

Page 18: Apache Spark RDDs
Page 19: Apache Spark RDDs

TwitterUtils.createStream() .filter(_.getText.contains("Spark")) .countByWindow(Seconds(5))

Page 20: Apache Spark RDDs
Page 21: Apache Spark RDDs

GraphX

• Optimally partitions and indexes vertices and edges represented as RDDs

• APIs to join and traverse graphs

• PageRank, connected components, triangle counting

Page 22: Apache Spark RDDs

val graph = Graph(userIdRDD, assocRDD)

val ranks = graph.pageRank(0.0001).vertices

val userRDD = sc.textFile("graphx/data/users.txt")val users = userRdd.map { line => val fields = line.split(",") (fields(0).toLong, fields(1))}val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) => (username, rank)}

Page 23: Apache Spark RDDs
Page 24: Apache Spark RDDs

MLib

• Machine learning library similar to Mahout

• Statistics, regression, decision trees, clustering, PCA, gradient descent

• Iterative algorithms much faster due to in memory caching

Page 25: Apache Spark RDDs

val data = sc.textFile("data.txt")val parsedData = data.map { line => val parts = line.split(',') LabeledPoint( parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)) )}

val model = LinearRegressionWithSGD.train( parsedData, 100)

val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}val MSE = valuesAndPreds .map{case(v, p) => math.pow((v - p), 2)}.mean()

Page 26: Apache Spark RDDs

RDDs• Resilient distributed datasets

• Familiar Scala collections API

• Distributed data and computation

• Monadic expression of transformations

• But not monads

Page 27: Apache Spark RDDs

Pseudo Monad

• Wraps iterator + partitions distribution

• Keeps track of history for fault tolerance

• Lazily evaluated, chaining of expressions

Page 28: Apache Spark RDDs

https://databricks-training.s3.amazonaws.com/slides/advanced-spark-training.pdf

Page 29: Apache Spark RDDs

RDD Interface

• compute: transformation applied to iterable(s)

• getPartitions: partition data for parallel computation

• getDependencies: lineage of parent RDDs and if shuffle is required

Page 30: Apache Spark RDDs

HadoopRDD

• compute: read HDFS block or file split

• getPartitions: HDFS block or file split

• getDependencies: None

Page 31: Apache Spark RDDs

MappedRDD

• compute: compute parent and map result

• getPartitions: parent partition

• getDependencies: single dependency on parent

Page 32: Apache Spark RDDs

CoGroupedRDD

• compute: compute, shuffle then group parent RDDs

• getPartitions: one per reduce task

• getDependencies: shuffle each parent RDD

Page 33: Apache Spark RDDs

Summary

• Simple Unified API through RDDs

• Interactive Analysis

• Hadoop Integration

• Performance

Page 34: Apache Spark RDDs

References• http://www.cs.berkeley.edu/~matei/papers/2010/

hotcloud_spark.pdf

• https://www.youtube.com/watch?v=HG2Yd-3r4-M

• https://www.youtube.com/watch?v=e-Ys-2uVxM0

• RDD, MappedRDD, SchemaRDD, RDDFunctions, GraphOps, DStream