Spark Streaming & Kafka-The Future of Stream Processing

27
| 06/13/ 2022 Jack Gudenkauf VP Big Data ala> sc.parallelize(List("Kafka Spark Vertica"), 3).mapPartitions(iter => { iter.toList.map(x=>print(x)) }.iterator) https://twitter.com/_JG

Transcript of Spark Streaming & Kafka-The Future of Stream Processing

Page 1: Spark Streaming & Kafka-The Future of Stream Processing

|05/03/2023

Jack Gudenkauf VP Big Data

scala> sc.parallelize(List("Kafka Spark Vertica"), 3).mapPartitions(iter => { iter.toList.map(x=>print(x)) }.iterator).collect; println)(

https://twitter.com/_JG

Page 2: Spark Streaming & Kafka-The Future of Stream Processing

2

PLAYTIKA Founded in 2010

Social Casino global category leader 10 games

13 platforms 1000+ employees

Page 3: Spark Streaming & Kafka-The Future of Stream Processing

3© Cloudera, Inc. All rights reserved.

Hari Shreedharan, Software Engineer @ ClouderaCommitter/PMC Member, Apache FlumeCommitter, Apache SqoopContributor, Apache SparkAuthor, Using Flume (O’Reilly)

Spark + Kafka:Future of Streaming Processing

Page 4: Spark Streaming & Kafka-The Future of Stream Processing

4© Cloudera, Inc. All rights reserved.

Motivation for Real-Time Stream Processing

Data is being created at unprecedented rates• Exponential data growth from mobile, web, social• Connected devices: 9B in 2012 to 50B by 2020• Over 1 trillion sensors by 2020• Datacenter IP traffic growing at CAGR of 25%

How can we harness it data in real-time?• Value can quickly degrade → capture value immediately• From reactive analysis to direct operational impact• Unlocks new competitive advantages• Requires a completely new approach...

Page 5: Spark Streaming & Kafka-The Future of Stream Processing

5© Cloudera, Inc. All rights reserved.

From Volume and Variety to Velocity

PresentBatch + Stream Processing

Time to Insight of Seconds

Big-Data = Volume + Variety

Big-Data = Volume + Variety + Velocity

PastPresent

Hadoop Ecosystem evolves as well…Past

Big Data has evolved

Batch Processing

Time to insight of Hours

Page 6: Spark Streaming & Kafka-The Future of Stream Processing

6© Cloudera, Inc. All rights reserved.

Key Components of Streaming Architectures

Data Ingestion & TransportationService

Real-Time Stream Processing Engine

Kafka Flume

System Management

Security

Data Management & Integration

Real-TimeData Serving

Page 7: Spark Streaming & Kafka-The Future of Stream Processing

7© Cloudera, Inc. All rights reserved.

Canonical Stream Processing Architecture

Kafka

Data IngestApp 1

App 2

.

.

.

Kafka Flume

HDFS HBaseData

Sources

Page 8: Spark Streaming & Kafka-The Future of Stream Processing

8© Cloudera, Inc. All rights reserved.

Spark: Easy and Fast Big Data

•Easy to Develop•Rich APIs in Java, Scala, Python• Interactive shell

•Fast to Run•General execution graphs• In-memory storage

2-5× less codeUp to 10× faster on disk,

100× in memory

Page 9: Spark Streaming & Kafka-The Future of Stream Processing

9© Cloudera, Inc. All rights reserved.

Spark Architecture

Driver

Worker

Worker

Worker

DataRAM

Data

RAM

DataRAM

Tasks

Results

Page 10: Spark Streaming & Kafka-The Future of Stream Processing

10© Cloudera, Inc. All rights reserved.

RDDs

RDD = Resilient Distributed Datasets• Immutable representation of data• Operations on one RDD creates a new one• Memory caching layer that stores data in a distributed, fault-tolerant cache• Created by parallel transformations on data in stable storage• Lazy materialization

Two observations:a. Can fall back to disk when data-set does not fit in memoryb. Provides fault-tolerance through concept of lineage

Page 11: Spark Streaming & Kafka-The Future of Stream Processing

11© Cloudera, Inc. All rights reserved.

Spark StreamingExtension of Apache Spark’s Core API, for Stream Processing.

The Framework Provides

Fault Tolerance

Scalability

High-Throughput

Page 12: Spark Streaming & Kafka-The Future of Stream Processing

12© Cloudera, Inc. All rights reserved.

Spark Streaming• Incoming data represented as Discretized Streams (DStreams)• Stream is broken down into micro-batches• Each micro-batch is an RDD – can share code between batch and streaming

Page 13: Spark Streaming & Kafka-The Future of Stream Processing

13© Cloudera, Inc. All rights reserved.

val tweets = ssc.twitterStream()val hashTags = tweets.flatMap (status => getTags(status))hashTags.saveAsHadoopFiles("hdfs://...")

flatMap flatMap flatMap

save save save

batch @ t+1batch @ t batch @ t+2tweets DStream

hashTags DStream

Stream composed of small (1-10s) batch

computations

“Micro-batch” Architecture

Page 14: Spark Streaming & Kafka-The Future of Stream Processing

14© Cloudera, Inc. All rights reserved.

Use DStreams for Windowing Functions

Page 15: Spark Streaming & Kafka-The Future of Stream Processing

15© Cloudera, Inc. All rights reserved.

Spark Streaming

• Runs as a Spark job• YARN or standalone for scheduling• YARN has KDC integration

• Use the same code for real-time Spark Streaming and for batch Spark jobs.• Integrates natively with messaging systems such as Flume, Kafka, Zero MQ….• Easy to write “Receivers” for custom messaging systems.

Page 16: Spark Streaming & Kafka-The Future of Stream Processing

16© Cloudera, Inc. All rights reserved.

Sharing Code between Batch and Streaming

def filterErrors (rdd: RDD[String]): RDD[String] = {rdd.filter(s => s.contains(“ERROR”))

}

Library that filters “ERRORS”

• Streaming generates RDDs periodically• Any code that operates on RDDs can therefore be used in streaming as well

Page 17: Spark Streaming & Kafka-The Future of Stream Processing

17© Cloudera, Inc. All rights reserved.

Sharing Code between Batch and Streaming

val lines = sc.textFile(…)

val filtered = filterErrors(lines)

filtered.saveAsTextFile(...)

Spark:

val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435)

val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => {

filterErrors(rdd)

}))

filtered.saveAsTextFiles(…)

Spark Streaming:

Page 18: Spark Streaming & Kafka-The Future of Stream Processing

18© Cloudera, Inc. All rights reserved.

Reliability

• Received data automatically persisted to HDFS Write Ahead Log to prevent data loss• set spark.streaming.receiver.writeAheadLog.enable=true in spark conf

• When AM dies, the application is restarted by YARN• Received, ack-ed and unprocessed data replayed from WAL (data that made it

into blocks)• Reliable Receivers can replay data from the original source, if required• Un-acked data replayed from source.• Kafka, Flume receivers bundled with Spark are examples

• Reliable Receivers + WAL = No data loss on driver or receiver failure!

Page 19: Spark Streaming & Kafka-The Future of Stream Processing

19© Cloudera, Inc. All rights reserved.

Reliable Kafka DStream

• Stores received data to Write Ahead Log on HDFS for replay – no data loss!• Stable and supported!• Uses a reliable receiver to pull data from Kafka• Application-controlled parallelism• Create as many receivers as you want to parallelize• Remember – each receiver is a task and holds one executor hostage, no

processing happens on that executor.• Tricky to do this efficiently, so is controlling ordering (everything needs to be

done explicitly

Page 20: Spark Streaming & Kafka-The Future of Stream Processing

20© Cloudera, Inc. All rights reserved.

Reliable Kafka Dstream - Issues

•Kafka can replay messages if processing failed for some reason • So WAL is overkill – causes unnecessary performance hit• In addition, the Reliable Stream causes a lot of network traffic due to unneeded HDFS writes etc.•Receivers hold executors hostage – which could otherwise be used for processing•How can we solve these issues?

Page 21: Spark Streaming & Kafka-The Future of Stream Processing

21© Cloudera, Inc. All rights reserved.

Direct Kafka DStream

• No long-running receiver = no executor hogging!• Communicates with Kafka via the “low-level API”• 1 Spark partition Kafka partition• At the end of every batch:• The first message after the last batch to the current latest message in partition• If max rate is configured, then rate x batch interval is downloaded & processed• Checkpoint contains the starting and ending offset in the current RDD• Recovering from checkpoint is simple – last offset + 1 is least offset of next

batch

Page 22: Spark Streaming & Kafka-The Future of Stream Processing

22© Cloudera, Inc. All rights reserved.

Direct Kafka DStream

• (Almost) Exactly once processing• At the end of each interval, the RDD can provide information about the starting

and ending offset• These offsets can be persisted, so even on failure – recover from there• Edge cases are possible and can cause duplicates• Failure in the middle of HDFS writes -> duplicates!• Failure after processing but before offsets getting persisted -> duplicates!• More likely!• Writes to Kafka also can cause duplicates, so do reads from Kafka• Fix: You app should really be resilient to duplicates

Page 23: Spark Streaming & Kafka-The Future of Stream Processing

23© Cloudera, Inc. All rights reserved.

Spark Streaming Use-Cases

• Real-time dashboards • Show approximate results in real-time• Reconcile periodically with source-of-truth using Spark

• Joins of multiple streams• Time-based or count-based “windows”• Combine multiple sources of input to produce composite data

• Re-use RDDs created by Streaming in other Spark jobs.

Page 24: Spark Streaming & Kafka-The Future of Stream Processing

24© Cloudera, Inc. All rights reserved.

What is coming?

• Better Monitoring and alerting• Batch-level and task-level monitoring

• SQL on Streaming• Run SQL-like queries on top of Streaming (medium – long term)

• Python!• Limited support already available, but more detailed support coming

• ML• More real-time ML algorithms

Page 25: Spark Streaming & Kafka-The Future of Stream Processing

25© Cloudera, Inc. All rights reserved.

Current Spark project status

• 400+ contributors and 50+ companies contributing• Includes: Databricks, Cloudera, Intel, Huawei, Yahoo! etc• Dozens of production deployments• Spark Streaming Survived Netflix Chaos Monkey – production ready!• Included in CDH!

Page 26: Spark Streaming & Kafka-The Future of Stream Processing

26© Cloudera, Inc. All rights reserved.

More Info..

• CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html• Cloudera Blog: http://blog.cloudera.com/blog/category/spark/• Apache Spark homepage: http://spark.apache.org/• Github: https://github.com/apache/spark

Page 27: Spark Streaming & Kafka-The Future of Stream Processing

27© Cloudera, Inc. All rights reserved.

Thank [email protected]@harisr1234