Spark Streaming with Cassandra

53
spark streaming with C* [email protected]

Transcript of Spark Streaming with Cassandra

Page 1: Spark Streaming with Cassandra

spark streamingwith C*

[email protected]

Page 2: Spark Streaming with Cassandra

…applies where you need near-realtime data analysis

Page 3: Spark Streaming with Cassandra

Spark vs Spark Streaming

zillions of bytes gigabytes per second

stat

ic d

atas

etstream

of data

Page 4: Spark Streaming with Cassandra

applications sensors web mobile phones

intrusion detection malfunction detection site analytics network metrics analysis

fraud detection dynamic process optimisation recommendations location based ads

log processing supply chain planning sentiment analysis spying

What can you do with it?

Page 5: Spark Streaming with Cassandra

applications sensors web mobile phones

intrusion detection malfunction detection site analytics network metrics analysis

fraud detection dynamic process optimisation recommendations location based ads

log processing supply chain planning sentiment analysis spying

What can you do with it?

Page 6: Spark Streaming with Cassandra

AlmostWhateverSource

YouWant

AlmostWhatever

DestinationYou

Want

Page 7: Spark Streaming with Cassandra
Page 8: Spark Streaming with Cassandra
Page 9: Spark Streaming with Cassandra

so, let’s see how it works

Page 10: Spark Streaming with Cassandra

μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD)

DStream - A continuous sequence of micro batches

Processing of DStream = Processing of μBatches, RDDs

DStream

Page 11: Spark Streaming with Cassandra

Receiver9 8 7 6 5 4 3 2 1Interface between different stream sources and Spark

Page 12: Spark Streaming with Cassandra

Receiver9 8 7 6 5 4 3 2 1

Spark memory boundaryBlock Manager

Interface between different stream sources and Spark

Page 13: Spark Streaming with Cassandra

Receiver9 8 7 6 5 4 3 2 1

Spark memory boundaryBlock Manager

Replication and building μBatches

Interface between different stream sources and Spark

Page 14: Spark Streaming with Cassandra

Spark memory boundaryBlock Manager

Page 15: Spark Streaming with Cassandra

Spark memory boundaryBlock Manager

9 8 7 6 5 4 3 2 1

Blocks of input data

Page 16: Spark Streaming with Cassandra

Spark memory boundaryBlock Manager

9 8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

Blocks of input data

μBatch made of blocks

Page 17: Spark Streaming with Cassandra

9 8 7 6 5 4 3 2 1

μBatch made of blocks

Page 18: Spark Streaming with Cassandra

9 8 7 6 5 4 3 2 1

μBatch made of blocks

Partition Partition Partition

Page 19: Spark Streaming with Cassandra

9 8 7 6 5 4 3 2 1

μBatch made of blocks

Partition Partition Partition

Page 20: Spark Streaming with Cassandra

Ingestion from multiple sources

Receiving,μBatch building

Receiving,μBatch building

Receiving,μBatch building

Page 21: Spark Streaming with Cassandra

Ingestion from multiple sources

Receiving,μBatch building

Receiving,μBatch building

Receiving,μBatch building

2s 1s 0s

μBatch μBatch

Page 22: Spark Streaming with Cassandra

A well-worn example

• ingestion of text messages• splitting them into separate words• count the occurrence of words within 5

seconds windows• save word counts from the last 5 seconds,

every 5 second to Cassandra, and display the first few results on the console

Page 23: Spark Streaming with Cassandra

how to do that ?

well…

Page 24: Spark Streaming with Cassandra

Yes, it is that easycase class WordCount(time: Long, word: String, count: Int)val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph}val words: DStream[String] = paragraphs.flatMap(_.split( """\s+""")) val wordCounts: DStream[(String, Long)] = words.countByValue()val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { case (word, count) => (count.toInt, WordCount(time.milliseconds, word, count.toInt)) } val topWordCountsRDD: RDD[WordCount] = mappedWordCounts .sortByKey(ascending = false).values) topWordsStream.saveToCassandra("meetup", "word_counts") topWordsStream.print()

Page 25: Spark Streaming with Cassandra

DStream stateless operators(quick recap)

• map• flatMap• filter• repartition• union• count• countByValue

• reduce• reduceByKey• joins• cogroup• transform• transformWith

Page 26: Spark Streaming with Cassandra

DStream[Bean].count()

count 4 3

1s 1s 1s 1s

Page 27: Spark Streaming with Cassandra

DStream[Bean].count()

count 4 3

1s 1s 1s 1s

Page 28: Spark Streaming with Cassandra

DStream[Orange].union(DStream[Apple])

union

1s 1s

Page 29: Spark Streaming with Cassandra

Other stateless operations

• join(DStream[(K, W)])• leftOuterJoin(DStream[(K, W)])• rightOuterJoin(DStream[(K, W)])• cogroup(DStream[(K, W)])

are applied on pairs of corresponding μBatches

Page 30: Spark Streaming with Cassandra

transform, transformWith

• DStream[T].transform(RDD[T] => RDD[U]): DStream[U]• DStream[T].transformWith(DStream[U], (RDD[T], RDD[U]) => RDD[V]): DStream[V]

allow you to create new stateless operators

Page 31: Spark Streaming with Cassandra

DStream[Blue].transformWith(DStream[Red], …): DStream[Violet]

1-A 2-A 3-A

1-B 2-B 3-B

1-A x 1-B 2-A x 2-B 3-A x 3-B

Page 32: Spark Streaming with Cassandra

DStream[Blue].transformWith(DStream[Red], …): DStream[Violet]

1-A 2-A 3-A

1-B 2-B 3-B

1-A x 1-B 2-A x 2-B 3-A x 3-B

Page 33: Spark Streaming with Cassandra

DStream[Blue].transformWith(DStream[Red], …): DStream[Violet]

1-A 2-A 3-A

1-B 2-B 3-B

1-A x 1-B 2-A x 2-B 3-A x 3-B

Page 34: Spark Streaming with Cassandra

Windowing

0s 1s 2s 3s 4s 5s 6s 7s

By default:window = slide = μBatch duration

window

slide

Page 35: Spark Streaming with Cassandra

Windowing

0s 1s 2s 3s 4s 5s 6s 7s

By default:window = slide = μBatch duration

window

slide

Page 36: Spark Streaming with Cassandra

Windowing

0s 1s 2s 3s 4s 5s 6s 7s

By default:window = slide = μBatch duration

window

slide

Page 37: Spark Streaming with Cassandra

Windowing

The resulting DStream consists of 3 seconds μBatches!

Each resulting μBatch overlaps the preceding one by 1 second

0s 1s 2s 3s 4s 5s 6s 7s

window

slide

Page 38: Spark Streaming with Cassandra

Windowing

The resulting DStream consists of 3 seconds μBatches!

Each resulting μBatch overlaps the preceding one by 1 second

0s 1s 2s 3s 4s 5s 6s 7s

window

slide

Page 39: Spark Streaming with Cassandra

Windowing

The resulting DStream consists of 3 seconds μBatches!

Each resulting μBatch overlaps the preceding one by 1 second

0s 1s 2s 3s 4s 5s 6s 7s

window

slide

Page 40: Spark Streaming with Cassandra

Windowing

1 2 3 4 5 6 7 8 1 2 3 4 5 6 3 4 5 6 7 8window

window

slide

μBatch appears in output stream every 1s!

It contains messages collected during 3s

1s

Page 41: Spark Streaming with Cassandra

Windowing

1 2 3 4 5 6 7 8 1 2 3 4 5 6 3 4 5 6 7 8window

window

slide

μBatch appears in output stream every 1s!

It contains messages collected during 3s

1s

Page 42: Spark Streaming with Cassandra

DStream window operators

• groupByKeyAndWindow(Duration, Duration)• reduceByKeyAndWindow((V, V) => V, Duration, Duration)

• window(Duration, Duration)• countByWindow(Duration, Duration)• reduceByWindow(Duration, Duration, (T, T) => T)• countByValueAndWindow(Duration, Duration)

Page 43: Spark Streaming with Cassandra

Let’s modify the example

• ingestion of text messages• splitting them into separate words• count the occurrence of words within 10

seconds windows• save word counts from the last 10 seconds,

every 2 second to Cassandra, and display the first few results on the console

Page 44: Spark Streaming with Cassandra

Yes, it is still easy to docase class WordCount(time: Long, word: String, count: Int)val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph}val words: DStream[String] = paragraphs.flatMap(_.split( """\s+""")) val wordCounts: DStream[(String, Long)] = words.countByValueAndWindow(Seconds(10), Seconds(2)) val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { case (word, count) => (count.toInt, WordCount(time.milliseconds, word, count.toInt)) } val topWordCountsRDD: RDD[WordCount] = mappedWordCounts .sortByKey(ascending = false).values) topWordsStream.saveToCassandra("meetup", "word_counts") topWordsStream.print()

Page 45: Spark Streaming with Cassandra

DStream stateful operator• DStream[(K, V)].updateStateByKey

(f: (Seq[V], Option[S]) => Option[S]): DStream[(K, S)]

1

A

2

B

3

A

4

C

5

A

6

B

7

A

8

B

9

C

• R1 = f(Seq(1, 3, 5), Some(7))• R2 = f(Seq(2, 6), Some(8))• R3 = f(Seq(4), Some(9))

R1

A

R2

B

R3

C

Page 46: Spark Streaming with Cassandra

case class WordCount(time: Long, word: String, count: Int)def update(counts: Seq[Long], state: Option[Long]): Option[Long] = { val sum = counts.sum Some(state.getOrElse(0L) + sum)} val totalWords: DStream[(String, Long)] = stream.map { case (_, paragraph) => paragraph} .flatMap(_.split( """\s+""")) .countByValue() .updateStateByKey(update) val topTotalWordCounts: DStream[WordCount] = totalWords.transform((rdd, time) => rdd.map { case (word, count) => (count, WordCount(time.milliseconds, word, count.toInt)) }.sortByKey(ascending = false).values ) topTotalWordCounts.saveToCassandra("meetup", "word_counts_total") topTotalWordCounts.print()

Total word count example

Page 47: Spark Streaming with Cassandra

Obtaining DStreams

• ZeroMQ• Kinesis• HDFS compatible file system• Akka actor• Twitter• MQTT• Kafka• Socket• Flume• …

Page 48: Spark Streaming with Cassandra

Particular DStreams are available in separate modules

GroupId ArtifactId Latest Version

org.apache.spark spark-streaming-kinesis-asl_2.10 1.1.0

org.apache.spark spark-streaming-mqtt_2.10 1.1.0 all (7)

org.apache.spark spark-streaming-zeromq_2.10 1.1.0 all (7)

org.apache.spark spark-streaming-flume_2.10 1.1.0 all (7)

org.apache.spark spark-streaming-flume-sink_2.10 1.1.0

org.apache.spark spark-streaming-kafka_2.10 1.1.0 all (7)

org.apache.spark spark-streaming-twitter_2.10 1.1.0 all (7)

Page 49: Spark Streaming with Cassandra

If something goes wrong…

Page 50: Spark Streaming with Cassandra

Fault tolerance

The sequence of transformations is known

to Spark Streaming

μBatches are replicated once they are received

Lost data can be recomputed

Page 51: Spark Streaming with Cassandra

But there are pitfalls

• Spark replicates blocks, not single messages

• It is up to a particular receiver to decide whether to form the block from a single message or to collect more messages before pushing the block

• The data collected in the receiver before the block is pushed will be lost in case of failure of the receiver

• Typical tradeoff - efficiency vs fault tolerance

Page 52: Spark Streaming with Cassandra

Built-in receivers breakdown

Pushing single messages Can do both Pushing whole blocks

Kafka Akka RawNetworkReceiver

Twitter Custom ZeroMQ

Socket

MQTT

Page 53: Spark Streaming with Cassandra

Thank you !

Questions?!

http://spark.apache.org/https://github.com/datastax/spark-cassandra-connectorhttp://cassandra.apache.org/http://www.datastax.com/