Spark & Spark Streaming Internals - Nov 15 (1)

Spark & Spark Streaming Internals

Akhil [email protected]

Apache Spark

Spark Stack

Spark Internals

Resilient Distributed Dataset (RDD)

Restricted form of distributed shared memory ➔ Immutable➔ Can only be built through deterministic transformations (textFile, map,

filter, join, …)

Efficient fault recovery using lineage ➔ Recompute lost partitions on failure ➔ No cost if nothing fails

RDD Operations

Transformations

➔ map/flatmap➔ filter➔ union/join/groupBy➔ cache

…….

Actions➔ collect/count➔ save➔ take

…….

Log Mining Example

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))messages.persist()

messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count

Base RDD

Transformed RDD

Action

What is Spark Streaming?

What is Spark Streaming?

Framework for large scale stream processing

➔ Scales to 100s of nodes

➔ Can achieve second scale latencies

➔ Provides a simple batch-like API for implementing complex algorithm

➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.

Overview

Run a streaming computation as a series of very small, deterministic batch jobs

SparkStreaming

Spark

- Chop up the live stream into batches of X seconds

- Spark treats each batch of data as RDDs and processes them using RDD operations

- Finally, the processed results of the RDD operations are returned in batches

Key Concepts➔ DStream – sequence of RDDs representing a stream of data

- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets

➔ Transformations – modify data from one DStream to another- Standard RDD operations – map, countByValue, reduce, join, …

- Stateful operations – window, countByValueAndWindow, …

➔ Output Operations – send data to external entity- saveAsHadoopFiles – saves to HDFS- foreach – do anything with each batch of results

Eg: Get hashtags from Twitter

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)val hashTags = tweets.flatMap (status => status.getText.split(" ").filter(_.startsWith("#"))))hashTags.saveAsHadoopFiles("hdfs://...") Transformation

#Ebola, #India, #Mars ...

Thank You

Spark & Spark Streaming Internals - Nov 15 (1)

Documents

Transcript of Spark & Spark Streaming Internals - Nov 15 (1)