Spark Streaming with Cassandra

spark streamingwith C*

jacek.lewandowski@datastax.com

…applies where you need near-realtime data analysis

Spark vs Spark Streaming

zillions of bytes gigabytes per second

etstream

of data

applications sensors web mobile phones

intrusion detection malfunction detection site analytics network metrics analysis

fraud detection dynamic process optimisation recommendations location based ads

log processing supply chain planning sentiment analysis spying

What can you do with it?

applications sensors web mobile phones

intrusion detection malfunction detection site analytics network metrics analysis

fraud detection dynamic process optimisation recommendations location based ads

log processing supply chain planning sentiment analysis spying

What can you do with it?

AlmostWhateverSource

YouWant

AlmostWhatever

DestinationYou

so, let’s see how it works

μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD)

DStream - A continuous sequence of micro batches

Processing of DStream = Processing of μBatches, RDDs

DStream

Receiver9 8 7 6 5 4 3 2 1Interface between different stream sources and Spark

Receiver9 8 7 6 5 4 3 2 1

Spark memory boundaryBlock Manager

Interface between different stream sources and Spark

Receiver9 8 7 6 5 4 3 2 1

Replication and building μBatches

Interface between different stream sources and Spark

9 8 7 6 5 4 3 2 1

Blocks of input data

9 8 7 6 5 4 3 2 1

Blocks of input data

μBatch made of blocks

9 8 7 6 5 4 3 2 1

Partition Partition Partition

9 8 7 6 5 4 3 2 1

Partition Partition Partition

Ingestion from multiple sources

Receiving,μBatch building

Ingestion from multiple sources

Receiving,μBatch building

2s 1s 0s

μBatch μBatch

A well-worn example

• ingestion of text messages• splitting them into separate words• count the occurrence of words within 5

seconds windows• save word counts from the last 5 seconds,

every 5 second to Cassandra, and display the first few results on the console

how to do that ?

well…

Yes, it is that easycase class WordCount(time: Long, word: String, count: Int)val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph}val words: DStream[String] = paragraphs.flatMap(_.split( """\s+""")) val wordCounts: DStream[(String, Long)] = words.countByValue()val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { case (word, count) => (count.toInt, WordCount(time.milliseconds, word, count.toInt)) } val topWordCountsRDD: RDD[WordCount] = mappedWordCounts .sortByKey(ascending = false).values) topWordsStream.saveToCassandra("meetup", "word_counts") topWordsStream.print()

DStream stateless operators(quick recap)

• map• flatMap• filter• repartition• union• count• countByValue

• reduce• reduceByKey• joins• cogroup• transform• transformWith

DStream[Bean].count()

count 4 3

1s 1s 1s 1s

DStream[Bean].count()

count 4 3

1s 1s 1s 1s

DStream[Orange].union(DStream[Apple])

Other stateless operations

• join(DStream[(K, W)])• leftOuterJoin(DStream[(K, W)])• rightOuterJoin(DStream[(K, W)])• cogroup(DStream[(K, W)])

are applied on pairs of corresponding μBatches

transform, transformWith

• DStream[T].transform(RDD[T] => RDD[U]): DStream[U]• DStream[T].transformWith(DStream[U], (RDD[T], RDD[U]) => RDD[V]): DStream[V]

allow you to create new stateless operators

DStream[Blue].transformWith(DStream[Red], …): DStream[Violet]

1-A 2-A 3-A

1-B 2-B 3-B

1-A x 1-B 2-A x 2-B 3-A x 3-B

1-A 2-A 3-A

1-B 2-B 3-B

1-A x 1-B 2-A x 2-B 3-A x 3-B

1-A 2-A 3-A

1-B 2-B 3-B

1-A x 1-B 2-A x 2-B 3-A x 3-B

Windowing

0s 1s 2s 3s 4s 5s 6s 7s

By default:window = slide = μBatch duration

window

Windowing

0s 1s 2s 3s 4s 5s 6s 7s

window

Windowing

0s 1s 2s 3s 4s 5s 6s 7s

window

Windowing

The resulting DStream consists of 3 seconds μBatches!

Each resulting μBatch overlaps the preceding one by 1 second

0s 1s 2s 3s 4s 5s 6s 7s

window

Windowing

0s 1s 2s 3s 4s 5s 6s 7s

window

Windowing

0s 1s 2s 3s 4s 5s 6s 7s

window

Windowing

1 2 3 4 5 6 7 8 1 2 3 4 5 6 3 4 5 6 7 8window

window

μBatch appears in output stream every 1s!

It contains messages collected during 3s

Windowing

1 2 3 4 5 6 7 8 1 2 3 4 5 6 3 4 5 6 7 8window

window

μBatch appears in output stream every 1s!

It contains messages collected during 3s

DStream window operators

• groupByKeyAndWindow(Duration, Duration)• reduceByKeyAndWindow((V, V) => V, Duration, Duration)

• window(Duration, Duration)• countByWindow(Duration, Duration)• reduceByWindow(Duration, Duration, (T, T) => T)• countByValueAndWindow(Duration, Duration)

Let’s modify the example

• ingestion of text messages• splitting them into separate words• count the occurrence of words within 10

seconds windows• save word counts from the last 10 seconds,

every 2 second to Cassandra, and display the first few results on the console

Yes, it is still easy to docase class WordCount(time: Long, word: String, count: Int)val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph}val words: DStream[String] = paragraphs.flatMap(_.split( """\s+""")) val wordCounts: DStream[(String, Long)] = words.countByValueAndWindow(Seconds(10), Seconds(2)) val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { case (word, count) => (count.toInt, WordCount(time.milliseconds, word, count.toInt)) } val topWordCountsRDD: RDD[WordCount] = mappedWordCounts .sortByKey(ascending = false).values) topWordsStream.saveToCassandra("meetup", "word_counts") topWordsStream.print()

DStream stateful operator• DStream[(K, V)].updateStateByKey

(f: (Seq[V], Option[S]) => Option[S]): DStream[(K, S)]

• R1 = f(Seq(1, 3, 5), Some(7))• R2 = f(Seq(2, 6), Some(8))• R3 = f(Seq(4), Some(9))

case class WordCount(time: Long, word: String, count: Int)def update(counts: Seq[Long], state: Option[Long]): Option[Long] = { val sum = counts.sum Some(state.getOrElse(0L) + sum)} val totalWords: DStream[(String, Long)] = stream.map { case (_, paragraph) => paragraph} .flatMap(_.split( """\s+""")) .countByValue() .updateStateByKey(update) val topTotalWordCounts: DStream[WordCount] = totalWords.transform((rdd, time) => rdd.map { case (word, count) => (count, WordCount(time.milliseconds, word, count.toInt)) }.sortByKey(ascending = false).values ) topTotalWordCounts.saveToCassandra("meetup", "word_counts_total") topTotalWordCounts.print()

Total word count example

Obtaining DStreams

• ZeroMQ• Kinesis• HDFS compatible file system• Akka actor• Twitter• MQTT• Kafka• Socket• Flume• …

Particular DStreams are available in separate modules

GroupId ArtifactId Latest Version

org.apache.spark spark-streaming-kinesis-asl_2.10 1.1.0

org.apache.spark spark-streaming-mqtt_2.10 1.1.0 all (7)

org.apache.spark spark-streaming-zeromq_2.10 1.1.0 all (7)

org.apache.spark spark-streaming-flume_2.10 1.1.0 all (7)

org.apache.spark spark-streaming-flume-sink_2.10 1.1.0

org.apache.spark spark-streaming-kafka_2.10 1.1.0 all (7)

org.apache.spark spark-streaming-twitter_2.10 1.1.0 all (7)

If something goes wrong…

Fault tolerance

The sequence of transformations is known

to Spark Streaming

μBatches are replicated once they are received

Lost data can be recomputed

But there are pitfalls

• Spark replicates blocks, not single messages

• It is up to a particular receiver to decide whether to form the block from a single message or to collect more messages before pushing the block

• The data collected in the receiver before the block is pushed will be lost in case of failure of the receiver

• Typical tradeoff - efficiency vs fault tolerance

Built-in receivers breakdown

Pushing single messages Can do both Pushing whole blocks

Kafka Akka RawNetworkReceiver

Twitter Custom ZeroMQ

Socket

Thank you !

Questions?!

http://spark.apache.org/https://github.com/datastax/spark-cassandra-connectorhttp://cassandra.apache.org/http://www.datastax.com/

Spark Streaming with Cassandra

Data & Analytics

Transcript of Spark Streaming with Cassandra

Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul Kumar, Sigmoid) | C* Summit 2016

Streaming Office Hours today after the lecture until 7pm. Streaming Overview Spark Streaming Spark Streaming Programming Final Project Announcement Outline Streaming Overview Spark

Cassandra & Spark for IoT

Introduction to Cassandra • Why Spark + Cassandra ... · • Introduction to Cassandra • Why Spark + Cassandra • Problem background and overall architecture •Implementation

Cassandra and Spark

Spark Cassandra 2016

Scanned by CamScanner · tools like Cassandra Architecture, Data Model Creation, Database Interfaces, Advanced Architecture, Spark, Scala, RDD, SparkSQL, Spark Streaming, Spark ML,GraphX,

Streaming Big Data with Spark Streaming, Kafka, Cassandra ...chariotsolutions.com/wp-content/uploads/2015/04/HelenaEdelson_ETE... · Streaming Big Data with Spark Streaming, Kafka,

Datastax Cassandra + Spark Streaming

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

Spark Streaming with Cassandra

Cassandra spark connector

ÁìÇb AIF GP=C · Cassandra, MongoDB, Redis, MySQL, ElasticSearch/solr Reporting, Visualization (Tableau, Zepplin, Hue…) Storm/Heron Spark Streaming Flink Spark MapReduce HDFS

Cassandra and Spark SQL

DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela) | Cassandra Summit 2016

How To Get Monitoring Right For Streaming & Fast Data Systems Built With Spark, Mesos, Akka, Cassandra & Kafka

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Performance Analysis of Spark using k-means · like Cassandra (Spark Cassandra Connector) and R (SparkR). With Cassandra Connector, you can use Spark to access data stored in a Cassandra

Spark/Cassandra Integration Theory & Practicedoanduyhai Spark/Cassandra Integration Theory & Practice DuyHai DOAN, Technical Advocate

Using Spark over Cassandra