Spark streaming with kafka

of 35 /35
Spark Streaming Kafka in Action Dori Waldman Big Data Lead

Embed Size (px)

Transcript of Spark streaming with kafka

Page 1: Spark streaming with kafka

Spark StreamingKafka in Action

Dori WaldmanBig Data Lead

Page 2: Spark streaming with kafka

Spark Streaming with Kafka – Receiver Based

Spark Streaming with Kafka – Direct (No Receiver)

Statefull Spark Streaming (Demo)


Page 3: Spark streaming with kafka

What we do … Ad-ExchangeReal time trading (150ms average response time) and optimize campaigns over ad spaces.

Tech Stack :

Page 4: Spark streaming with kafka

Why Spark ...

Page 5: Spark streaming with kafka

Use CaseTens of Millions of transactions per minute (and growing …) ~ 15TB daily (24/7 99.99999 resiliency)

Data Aggregation: (#Video Success Rate)

Real time Aggregation and DB update Raw data persistency as recovery backupRetrospective aggregation updates (recalculate)

Analytic Data :

Persist incoming events (Raw data persistency) Real time analytics and ML algorithm (inside)

Page 6: Spark streaming with kafka
Page 7: Spark streaming with kafka

Based on high-level Kafka consumer

The receiver stores Kafka messages in executors/workers

Write-Ahead Logs to recover data on failures – Recommended

ZK offsets are updated by Spark

Data duplication (WAL/Kafka)

Receiver Approach - ”KafkaUtils.createStream”

Page 8: Spark streaming with kafka

Receiver Approach - Code

Spark Partition != Kafka Partition

val kafkaStream = { …



Page 9: Spark streaming with kafka

Receiver Approach – Code (continue)

Page 10: Spark streaming with kafka

Architecture 1.0




Raw Data






Spark Batch

Spark Stream

Page 11: Spark streaming with kafka


Pros: Worked just fine with single MySQL server Simplicity – legacy code stays the same Real-time DB updates Partial Aggregation was done in Spark, DB was updated via “Insert On Duplicate Key Update”

Cons: MySQL limitations (MySQL sharding is an issue, Cassandra is optimal) S3 raw data (in standard formats) is not trivial when using Spark

Page 12: Spark streaming with kafka


Page 13: Spark streaming with kafka
Page 14: Spark streaming with kafka

Architecture 2.0




Raw Data


Stream starts from largest “offset” by default

Parquet – columnar format (FS not DB)

Spark batch update C* every few minutes (overwrite)


ConsumerRaw Data

Raw Data


Page 15: Spark streaming with kafka


Pros: Parquet is ideal for Spark analytics Backup data requires less disk space

Cons: DB is not updated in real time (streaming), we could use combination with MySQL for current hour...

What has been changed: C* uses counters for “sum/update” which is a “bad” practice(no “insert on duplicate key” using MySQL) Parquet conversion is a heavy job and it seems that streaming hourlyconversions (using batch in case of failure) is a better approach

Page 16: Spark streaming with kafka

Direct Approach – ”KafkaUtils.createDirectStream”

Based on Kafka simple consumer

Queries Kafka for the latest offsets in each topic+partition, define offset range for batch

No need to create multiple input Kafka streams and consolidate them

Spark creates an RDD partition for each Kafka partition so data is consumed in parallel

ZK offsets are not updated by Spark, offsets aretracked by Spark within its checkpoints (might notrecover)

No data duplication (no WAL)

Page 17: Spark streaming with kafka


Save metadata – needed for recovery from driver failures

RDD for statefull transformations (RDDs of previous batches)


Page 18: Spark streaming with kafka

Transfer data from driver to workersBroadcast - keep a read-only variable cached on each machine rather than shipping a copy of it with tasks

Accumulator - used to implement counters/sum, workers can only add to accumulator, driver can read its value (you can extends AccumulatorParam[Vector])

Static (Scala Object)

Context (rdd) – get data after recovery

Page 19: Spark streaming with kafka

Direct Approach - Code

Page 20: Spark streaming with kafka

def start(sparkConfig: SparkConfiguration, decoder: String) { val ssc = StreamingContext.getOrCreate(sparkCheckpointDirectory(sparkConfig),()=>functionToCreateContext(decoder,sparkConfig))

sys.ShutdownHookThread { ssc.stop(stopSparkContext = true, stopGracefully = true) }

ssc.start() ssc.awaitTermination() }

In house code

Page 21: Spark streaming with kafka

def functionToCreateContext(decoder: String,sparkConfig: SparkConfiguration ):StreamingContext = {

val sparkConf = new SparkConf().setMaster(sparkClusterHost).setAppName(sparkConfig.jobName) sparkConf.set(S3_KEY, sparkConfig.awsKey) sparkConf.set(S3_CREDS, sparkConfig.awsSecret) sparkConf.set(PARQUET_OUTPUT_DIRECTORY, sparkConfig.parquetOutputDirectory)

val sparkContext = SparkContext.getOrCreate(sparkConf)

// Hadoop S3 writer optimization sparkContext.hadoopConfiguration.set("spark.sql.parquet.output.committer.class", "org.apache.spark.sql.parquet.DirectParquetOutputCommitter")

// Same as Avro, Parquet also supports schema evolution. This work happens in driver and takes // relativly long time sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") sparkContext.hadoopConfiguration.setInt("", 100) val ssc = new StreamingContext(sparkContext, Seconds(sparkConfig.batchTime)) ssc.checkpoint(sparkCheckpointDirectory(sparkConfig))

In house code (continue)

Page 22: Spark streaming with kafka

// evaluate stream value happened only if checkpoint folder is not exist val streams = sparkConfig.kafkaConfig.streams map { c => val topic = c.topic.split(",").toSet KafkaUtils.createDirectStream[String, String, StringDecoder, JsonDecoder](ssc, c.kafkaParams, topic) }

streams.foreach { dsStream => {

dsStream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

for (o <- offsetRanges) { logInfo(s"Offset on the driver: ${offsetRanges.mkString}") } val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")// Data recovery after crash val s3Accesskey = rdd.context.getConf.get(S3_KEY) val s3SecretKey = rdd.context.getConf.get(S3_CREDS) val outputDirectory = rdd.context.getConf.get(PARQUET_OUTPUT_DIRECTORY)

In house code (continue)

Page 23: Spark streaming with kafka

val data = val carpetData = data.count() if (carpetData > 0) {

// coalesce(1) – Data transfer optimization during shuffle data.coalesce(1).write.mode(SaveMode.Append).partitionBy "day", "hour").parquet(“s3a//...")

// In case of S3Exception will not continue to update ZK.zk.updateNode(o.topic, o.partition.toString, kafkaConsumerGroup, o.untilOffset.toString.getBytes)

} } } } ssc }

In house code (continue)

Page 24: Spark streaming with kafka

SaveMode (Append/Overwrite) used to handle exist data (add new file / overwrite)

Spark Streaming does not update ZK (

Spark Streaming saves offset in its checkpoint folder. Once it crashes it will continue from the last offset

You can avoid using checkpoint for offsets and manage it manually


Page 25: Spark streaming with kafka

val sparkConf = new SparkConf().setMaster("local[4]").setAppName("demo")val sparkContext = SparkContext.getOrCreate(sparkConf)val sqlContext = SQLContext.getOrCreate(sparkContext)val data ="table", "day") parquet (outputFolder)

Batch Code

Page 26: Spark streaming with kafka

Built in support for backpressure Since Spark 1.5 (default is disabled) Reciever – spark.streaming.receiver.maxRate

Direct – spark.streaming.kafka.maxRatePerPartition

Back Pressure

Page 27: Spark streaming with kafka

Links – Spark & Kafka integration

Page 28: Spark streaming with kafka

Architecture – other spark options

We can use hourly window , do the aggregation in spark and overwrite C* raw in real time …

Page 29: Spark streaming with kafka

Stateful Spark Streaming

Page 30: Spark streaming with kafka

Architecture 3.0




Raw Data




Raw Data



Raw Data

Analytic data uses spark stream to transfer Kafka raw data to Parquet.Regular Kafka consumer saves raw data backup in S3 (for streaming failure, spark batch will convert them to parquet)

Aggregation data uses statefull Spark Streaming (mapWithState) to update C*In case streaming failure spark batch will update data from Parquet to C*

Page 31: Spark streaming with kafka


Pros: Real-time DB updates

Cons: Too many components, relatively expensive (comparing to phase 1) According to documentation Spark upgrade has an issue with checkpoint

Page 32: Spark streaming with kafka

Whats Next … FiloDB ? (probably not , lots of nodes)

Parquet performance based on C*

Page 33: Spark streaming with kafka


Page 34: Spark streaming with kafka

val ssc = new StreamingContext(sparkConfig.sparkConf, Seconds(batchTime)) val kafkaStreams = (1 to sparkConfig.workers) map { i => new FixedKafkaInputDStream[String, AggregationEvent, StringDecoder, SerializedDecoder[AggregationEvent]](ssc, kafkaConfiguration.kafkaMapParams, topicMap, StorageLevel.MEMORY_ONLY_SER).map(_._2) // for write ahead log }

val unifiedStream = ssc.union(kafkaStreams) // manage all streams as one

val mapped = unifiedStream flatMap { event => Aggregations.getEventAggregationsKeysAndValues(Option(event)) // convert event to aggregation object which contains //key (“advertiserId”, “countryId”) and values (“click”, “impression”) }

val reduced = mapped.reduceByKey { _ + _ // per aggregation type we created “+” method that //describe how to do the aggregation }

K1 = advertiserId = 5countryId = 8

V1 = clicks = 2 impression = 17

k1(e), v1(e)k1(e), v2(e)

k2(e), v3(e)

k1(e), v1+v2

k2(e), v3(e)

In house Code –

Page 35: Spark streaming with kafka

Kafka messages semantics