Spark & Spark Streaming Internals - Nov 15 (1)

download Spark & Spark Streaming Internals - Nov 15 (1)

of 12

  • date post

    22-Jul-2015
  • Category

    Documents

  • view

    264
  • download

    0

Embed Size (px)

Transcript of Spark & Spark Streaming Internals - Nov 15 (1)

  • Spark & Spark Streaming Internals

    Akhil Dasakhil@sigmoidanalytics.com

  • Apache Spark

    Spark Stack

  • Spark Internals

  • Resilient Distributed Dataset (RDD)

    Restricted form of distributed shared memory Immutable Can only be built through deterministic transformations (textFile, map,

    filter, join, )

    Efficient fault recovery using lineage Recompute lost partitions on failure No cost if nothing fails

  • RDD Operations

    Transformations

    map/flatmap filter union/join/groupBy cache

    .

    Actions collect/count save take

    .

  • Log Mining Example

    Load error messages from a log into memory, then interactively search for various patterns

    lines = spark.textFile(hdfs://...) errors = lines.filter(_.startsWith(ERROR))messages = errors.map(_.split(\t)(2))messages.persist()

    messages.filter(_.contains(foo)).count messages.filter(_.contains(bar)).count

    Base RDD

    Transformed RDD

    Action

  • What is Spark Streaming?

  • What is Spark Streaming?

    Framework for large scale stream processing

    Scales to 100s of nodes

    Can achieve second scale latencies

    Provides a simple batch-like API for implementing complex algorithm

    Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.

  • Overview

    Run a streaming computation as a series of very small, deterministic batch jobs

    SparkStreaming

    Spark

    - Chop up the live stream into batches of X seconds

    - Spark treats each batch of data as RDDs and processes them using RDD operations

    - Finally, the processed results of the RDD operations are returned in batches

  • Key Concepts DStream sequence of RDDs representing a stream of data

    - Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets

    Transformations modify data from one DStream to another- Standard RDD operations map, countByValue, reduce, join,

    - Stateful operations window, countByValueAndWindow,

    Output Operations send data to external entity- saveAsHadoopFiles saves to HDFS- foreach do anything with each batch of results

  • Eg: Get hashtags from Twitter

    val tweets = ssc.twitterStream(, )val hashTags = tweets.flatMap (status => status.getText.split(" ").filter(_.startsWith("#"))))hashTags.saveAsHadoopFiles("hdfs://...") Transformation

    #Ebola, #India, #Mars ...

  • Thank You