Spark streaming with kafka
-
Author
dori-waldman -
Category
Software
-
view
2.161 -
download
1
Embed Size (px)
Transcript of Spark streaming with kafka

Spark StreamingKafka in Action
Dori WaldmanBig Data Lead

Spark Streaming with Kafka – Receiver Based
Spark Streaming with Kafka – Direct (No Receiver)
Statefull Spark Streaming (Demo)
Agenda

What we do … Ad-ExchangeReal time trading (150ms average response time) and optimize campaigns over ad spaces.
Tech Stack :

Why Spark ...

Use CaseTens of Millions of transactions per minute (and growing …) ~ 15TB daily (24/7 99.99999 resiliency)
Data Aggregation: (#Video Success Rate)
Real time Aggregation and DB update Raw data persistency as recovery backupRetrospective aggregation updates (recalculate)
Analytic Data :
Persist incoming events (Raw data persistency) Real time analytics and ML algorithm (inside)


Based on high-level Kafka consumer
The receiver stores Kafka messages in executors/workers
Write-Ahead Logs to recover data on failures – Recommended
ZK offsets are updated by Spark
Data duplication (WAL/Kafka)
Receiver Approach - ”KafkaUtils.createStream”

Receiver Approach - Code
Spark Partition != Kafka Partition
val kafkaStream = { …
Basic
Advanced

Receiver Approach – Code (continue)

Architecture 1.0
Stream
Events
Events
Raw Data
Events
Consumer
Consumer
Aggregation
Aggregation
Spark Batch
Spark Stream

Architecture
Pros: Worked just fine with single MySQL server Simplicity – legacy code stays the same Real-time DB updates Partial Aggregation was done in Spark, DB was updated via “Insert On Duplicate Key Update”
Cons: MySQL limitations (MySQL sharding is an issue, Cassandra is optimal) S3 raw data (in standard formats) is not trivial when using Spark

Monitoring


Architecture 2.0
Stream
Events
Events
Raw Data
Events
Stream starts from largest “offset” by default
Parquet – columnar format (FS not DB)
Spark batch update C* every few minutes (overwrite)
Consumer
ConsumerRaw Data
Raw Data
Aggregation

Architecture
Pros: Parquet is ideal for Spark analytics Backup data requires less disk space
Cons: DB is not updated in real time (streaming), we could use combination with MySQL for current hour...
What has been changed: C* uses counters for “sum/update” which is a “bad” practice(no “insert on duplicate key” using MySQL) Parquet conversion is a heavy job and it seems that streaming hourlyconversions (using batch in case of failure) is a better approach

Direct Approach – ”KafkaUtils.createDirectStream”
Based on Kafka simple consumer
Queries Kafka for the latest offsets in each topic+partition, define offset range for batch
No need to create multiple input Kafka streams and consolidate them
Spark creates an RDD partition for each Kafka partition so data is consumed in parallel
ZK offsets are not updated by Spark, offsets aretracked by Spark within its checkpoints (might notrecover)
No data duplication (no WAL)

S3 / HDFS
Save metadata – needed for recovery from driver failures
RDD for statefull transformations (RDDs of previous batches)
Checkpoint...

Transfer data from driver to workersBroadcast - keep a read-only variable cached on each machine rather than shipping a copy of it with tasks
Accumulator - used to implement counters/sum, workers can only add to accumulator, driver can read its value (you can extends AccumulatorParam[Vector])
Static (Scala Object)
Context (rdd) – get data after recovery

Direct Approach - Code

def start(sparkConfig: SparkConfiguration, decoder: String) { val ssc = StreamingContext.getOrCreate(sparkCheckpointDirectory(sparkConfig),()=>functionToCreateContext(decoder,sparkConfig))
sys.ShutdownHookThread { ssc.stop(stopSparkContext = true, stopGracefully = true) }
ssc.start() ssc.awaitTermination() }
In house code

def functionToCreateContext(decoder: String,sparkConfig: SparkConfiguration ):StreamingContext = {
val sparkConf = new SparkConf().setMaster(sparkClusterHost).setAppName(sparkConfig.jobName) sparkConf.set(S3_KEY, sparkConfig.awsKey) sparkConf.set(S3_CREDS, sparkConfig.awsSecret) sparkConf.set(PARQUET_OUTPUT_DIRECTORY, sparkConfig.parquetOutputDirectory)
val sparkContext = SparkContext.getOrCreate(sparkConf)
// Hadoop S3 writer optimization sparkContext.hadoopConfiguration.set("spark.sql.parquet.output.committer.class", "org.apache.spark.sql.parquet.DirectParquetOutputCommitter")
// Same as Avro, Parquet also supports schema evolution. This work happens in driver and takes // relativly long time sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") sparkContext.hadoopConfiguration.setInt("parquet.metadata.read.parallelism", 100) val ssc = new StreamingContext(sparkContext, Seconds(sparkConfig.batchTime)) ssc.checkpoint(sparkCheckpointDirectory(sparkConfig))
In house code (continue)

// evaluate stream value happened only if checkpoint folder is not exist val streams = sparkConfig.kafkaConfig.streams map { c => val topic = c.topic.split(",").toSet KafkaUtils.createDirectStream[String, String, StringDecoder, JsonDecoder](ssc, c.kafkaParams, topic) }
streams.foreach { dsStream => {
dsStream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
for (o <- offsetRanges) { logInfo(s"Offset on the driver: ${offsetRanges.mkString}") } val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")// Data recovery after crash val s3Accesskey = rdd.context.getConf.get(S3_KEY) val s3SecretKey = rdd.context.getConf.get(S3_CREDS) val outputDirectory = rdd.context.getConf.get(PARQUET_OUTPUT_DIRECTORY)
In house code (continue)

val data = sqlContext.read.json(rdd.map(_._2)) val carpetData = data.count() if (carpetData > 0) {
// coalesce(1) – Data transfer optimization during shuffle data.coalesce(1).write.mode(SaveMode.Append).partitionBy "day", "hour").parquet(“s3a//...")
// In case of S3Exception will not continue to update ZK.zk.updateNode(o.topic, o.partition.toString, kafkaConsumerGroup, o.untilOffset.toString.getBytes)
} } } } ssc }
In house code (continue)

SaveMode (Append/Overwrite) used to handle exist data (add new file / overwrite)
Spark Streaming does not update ZK (http://curator.apache.org/)
Spark Streaming saves offset in its checkpoint folder. Once it crashes it will continue from the last offset
You can avoid using checkpoint for offsets and manage it manually
Config...

val sparkConf = new SparkConf().setMaster("local[4]").setAppName("demo")val sparkContext = SparkContext.getOrCreate(sparkConf)val sqlContext = SQLContext.getOrCreate(sparkContext)val data = sqlContext.read.json(path)data.coalesce(1).write.mode(SaveMode.Overwrite).partitionBy("table", "day") parquet (outputFolder)
Batch Code

Built in support for backpressure Since Spark 1.5 (default is disabled) Reciever – spark.streaming.receiver.maxRate
Direct – spark.streaming.kafka.maxRatePerPartition
Back Pressure

https://www.youtube.com/watch?v=fXnNEq1v3VA&list=PL-x35fyliRwgfhffEpywn4q23ykotgQJ6&index=16
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
https://spark.apache.org/docs/1.6.0/streaming-programming-guide.html
http://spark.apache.org/docs/latest/streaming-programming-guide.html#deploying-applications
http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/
http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/
http://koeninger.github.io/kafka-exactly-once/#1
http://www.slideshare.net/miguno/being-ready-for-apache-kafka-apache-big-data-europe-2015
http://www.slideshare.net/SparkSummit/recipes-for-running-spark-streaming-apploications-in-production-tathagata-daspptx
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md
https://dzone.com/articles/uniting-spark-parquet-and-s3-as-an-alternative-to
http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/
https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/
Links – Spark & Kafka integration

Architecture – other spark options
We can use hourly window , do the aggregation in spark and overwrite C* raw in real time …

https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-spark-streaming.html
https://docs.cloud.databricks.com/docs/spark/1.6/examples/Streaming%20mapWithState.html
Stateful Spark Streaming

Architecture 3.0
Stream
Events
Events
Raw Data
Events
Consumer
Consumer
Raw Data
Aggregation
Aggregation
Raw Data
Analytic data uses spark stream to transfer Kafka raw data to Parquet.Regular Kafka consumer saves raw data backup in S3 (for streaming failure, spark batch will convert them to parquet)
Aggregation data uses statefull Spark Streaming (mapWithState) to update C*In case streaming failure spark batch will update data from Parquet to C*

Architecture
Pros: Real-time DB updates
Cons: Too many components, relatively expensive (comparing to phase 1) According to documentation Spark upgrade has an issue with checkpoint

http://www.slideshare.net/planetcassandra/tuplejump-breakthrough-olap-performance-on-cassandra-and-spark?ref=http://www.planetcassandra.org/blog/introducing-filodb/
Whats Next … FiloDB ? (probably not , lots of nodes)
Parquet performance based on C*

Questions?

val ssc = new StreamingContext(sparkConfig.sparkConf, Seconds(batchTime)) val kafkaStreams = (1 to sparkConfig.workers) map { i => new FixedKafkaInputDStream[String, AggregationEvent, StringDecoder, SerializedDecoder[AggregationEvent]](ssc, kafkaConfiguration.kafkaMapParams, topicMap, StorageLevel.MEMORY_ONLY_SER).map(_._2) // for write ahead log }
val unifiedStream = ssc.union(kafkaStreams) // manage all streams as one
val mapped = unifiedStream flatMap { event => Aggregations.getEventAggregationsKeysAndValues(Option(event)) // convert event to aggregation object which contains //key (“advertiserId”, “countryId”) and values (“click”, “impression”) }
val reduced = mapped.reduceByKey { _ + _ // per aggregation type we created “+” method that //describe how to do the aggregation }
K1 = advertiserId = 5countryId = 8
V1 = clicks = 2 impression = 17
k1(e), v1(e)k1(e), v2(e)
k2(e), v3(e)
k1(e), v1+v2
k2(e), v3(e)
In house Code –

Kafka messages semantics
(offset)