Apache Spark RDDs

Click here to load reader

Embed Size (px)

Transcript of Apache Spark RDDs

  • Apache Spark RDDsDean Chen eBay Inc.

  • http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf


  • Spark 2010 paper Berkley's AMPLab

    resilient distributed datasets (RDDs)

    Generalized distributed computation engine/platform

    Fault tolerant in memory caching

    Extensible interface for various work loads

  • http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf


  • https://amplab.cs.berkeley.edu/software/


  • RDDs Resilient distributed datasets

    "read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost"

    Familiar Scala collections API for distributed data and computation

    Monadic expression of lazy transformations, but not monads

  • Spark Shell

    Interactive queries and prototyping

    Local, YARN, Mesos

    Static type checking and auto complete


  • val titles = sc.textFile("titles.txt")

    val countsRdd = titles .flatMap(tokenize) .map(word => (cleanse(word), 1)) .reduceByKey(_ + _)

    val counts = countsRdd .filter{case(_, total) => total > 10000} .sortBy{case(_, total) => total} .filter{case(word, _) => word.length >= 5} .collect

  • Transformations

    map filter flatMap sample union intersection

    distinct groupByKey reduceByKey sortByKey join cogroup cartesian

  • Actions

    reduce collect count first

    take takeSample saveAsTextFile foreach

  • val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._

    case class Count(word: String, total: Int)

    val schemaRdd = countsRdd.map(c => Count(c._1, c._2))

    val count = schemaRdd .where('word === "scala") .select('total) .collect

  • schemaRdd.registerTempTable("counts")

    sql(" SELECT total FROM counts WHERE word = 'scala' ").collect

    schemaRdd .filter(_.word == "scala") .map(_.total) .collect

  • registerFunction("LEN", (_: String).length) val queryRdd = sql(" SELECT * FROM counts WHERE LEN(word) = 10 ORDER BY total DESC LIMIT 10 ") queryRdd .map(c => s"word: ${c(0)} \t| total: ${c(1)}") .collect() .foreach(println)

  • Spark Streaming

    Realtime computation similar to Storm

    Input distributed to memory for fault tolerance

    Streaming input in to sliding windows of RDDs

    Kafka, Flume, Kinesis, HDFS

  • TwitterUtils.createStream() .filter(_.getText.contains("Spark")) .countByWindow(Seconds(5))

  • GraphX

    Optimally partitions and indexes vertices and edges represented as RDDs

    APIs to join and traverse graphs

    PageRank, connected components, triangle counting

  • val graph = Graph(userIdRDD, assocRDD)

    val ranks = graph.pageRank(0.0001).vertices

    val userRDD = sc.textFile("graphx/data/users.txt")val users = userRdd.map { line => val fields = line.split(",") (fields(0).toLong, fields(1))}val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) => (username, rank)}

  • MLib

    Machine learning library similar to Mahout

    Statistics, regression, decision trees, clustering, PCA, gradient descent

    Iterative algorithms much faster due to in memory caching

  • val data = sc.textFile("data.txt")val parsedData = data.map { line => val parts = line.split(',') LabeledPoint( parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)) )}

    val model = LinearRegressionWithSGD.train( parsedData, 100)

    val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}val MSE = valuesAndPreds .map{case(v, p) => math.pow((v - p), 2)}.mean()

  • RDDs Resilient distributed datasets

    Familiar Scala collections API

    Distributed data and computation

    Monadic expression of transformations

    But not monads

  • Pseudo Monad

    Wraps iterator + partitions distribution

    Keeps track of history for fault tolerance

    Lazily evaluated, chaining of expressions

  • https://databricks-training.s3.amazonaws.com/slides/advanced-spark-training.pdf


  • RDD Interface

    compute: transformation applied to iterable(s)

    getPartitions: partition data for parallel computation

    getDependencies: lineage of parent RDDs and if shuffle is required

  • HadoopRDD

    compute: read HDFS block or file split

    getPartitions: HDFS block or file split

    getDependencies: None

  • MappedRDD

    compute: compute parent and map result

    getPartitions: parent partition

    getDependencies: single dependency on parent

  • CoGroupedRDD

    compute: compute, shuffle then group parent RDDs

    getPartitions: one per reduce task

    getDependencies: shuffle each parent RDD

  • Summary

    Simple Unified API through RDDs

    Interactive Analysis

    Hadoop Integration


  • References http://www.cs.berkeley.edu/~matei/papers/2010/




    RDD, MappedRDD, SchemaRDD, RDDFunctions, GraphOps, DStream


  • deanchen5@gmail.com