Apache Spark RDDs
-
Author
dean-chen -
Category
Data & Analytics
-
view
6.878 -
download
5
Embed Size (px)
description
Transcript of Apache Spark RDDs

Apache Spark RDDsDean Chen eBay Inc.

http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf

Spark• 2010 paper Berkley's AMPLab
• resilient distributed datasets (RDDs)
• Generalized distributed computation engine/platform
• Fault tolerant in memory caching
• Extensible interface for various work loads

http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf

RDDs• Resilient distributed datasets
• "read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost"
• Familiar Scala collections API for distributed data and computation
• Monadic expression of lazy transformations, but not monads

Spark Shell
• Interactive queries and prototyping
• Local, YARN, Mesos
• Static type checking and auto complete
• Lambdas


val titles = sc.textFile("titles.txt")
val countsRdd = titles .flatMap(tokenize) .map(word => (cleanse(word), 1)) .reduceByKey(_ + _)
val counts = countsRdd .filter{case(_, total) => total > 10000} .sortBy{case(_, total) => total} .filter{case(word, _) => word.length >= 5} .collect

Transformations
map filter flatMap sample union intersection
distinct groupByKey reduceByKey sortByKey join cogroup cartesian

Actions
reduce collect count first
take takeSample saveAsTextFile foreach


val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._
case class Count(word: String, total: Int)
val schemaRdd = countsRdd.map(c => Count(c._1, c._2))
val count = schemaRdd .where('word === "scala") .select('total) .collect

schemaRdd.registerTempTable("counts")
sql(" SELECT total FROM counts WHERE word = 'scala' ").collect
schemaRdd .filter(_.word == "scala") .map(_.total) .collect

registerFunction("LEN", (_: String).length) val queryRdd = sql(" SELECT * FROM counts WHERE LEN(word) = 10 ORDER BY total DESC LIMIT 10 ") queryRdd .map(c => s"word: ${c(0)} \t| total: ${c(1)}") .collect() .foreach(println)


Spark Streaming
• Realtime computation similar to Storm
• Input distributed to memory for fault tolerance
• Streaming input in to sliding windows of RDDs
• Kafka, Flume, Kinesis, HDFS


TwitterUtils.createStream() .filter(_.getText.contains("Spark")) .countByWindow(Seconds(5))


GraphX
• Optimally partitions and indexes vertices and edges represented as RDDs
• APIs to join and traverse graphs
• PageRank, connected components, triangle counting

val graph = Graph(userIdRDD, assocRDD)
val ranks = graph.pageRank(0.0001).vertices
val userRDD = sc.textFile("graphx/data/users.txt")val users = userRdd.map { line => val fields = line.split(",") (fields(0).toLong, fields(1))}val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) => (username, rank)}


MLib
• Machine learning library similar to Mahout
• Statistics, regression, decision trees, clustering, PCA, gradient descent
• Iterative algorithms much faster due to in memory caching

val data = sc.textFile("data.txt")val parsedData = data.map { line => val parts = line.split(',') LabeledPoint( parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)) )}
val model = LinearRegressionWithSGD.train( parsedData, 100)
val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}val MSE = valuesAndPreds .map{case(v, p) => math.pow((v - p), 2)}.mean()

RDDs• Resilient distributed datasets
• Familiar Scala collections API
• Distributed data and computation
• Monadic expression of transformations
• But not monads

Pseudo Monad
• Wraps iterator + partitions distribution
• Keeps track of history for fault tolerance
• Lazily evaluated, chaining of expressions

https://databricks-training.s3.amazonaws.com/slides/advanced-spark-training.pdf

RDD Interface
• compute: transformation applied to iterable(s)
• getPartitions: partition data for parallel computation
• getDependencies: lineage of parent RDDs and if shuffle is required

HadoopRDD
• compute: read HDFS block or file split
• getPartitions: HDFS block or file split
• getDependencies: None

MappedRDD
• compute: compute parent and map result
• getPartitions: parent partition
• getDependencies: single dependency on parent

CoGroupedRDD
• compute: compute, shuffle then group parent RDDs
• getPartitions: one per reduce task
• getDependencies: shuffle each parent RDD

Summary
• Simple Unified API through RDDs
• Interactive Analysis
• Hadoop Integration
• Performance

References• http://www.cs.berkeley.edu/~matei/papers/2010/
hotcloud_spark.pdf
• https://www.youtube.com/watch?v=HG2Yd-3r4-M
• https://www.youtube.com/watch?v=e-Ys-2uVxM0
• RDD, MappedRDD, SchemaRDD, RDDFunctions, GraphOps, DStream