Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and...
-
Upload
datastax -
Category
Technology
-
view
851 -
download
0
Transcript of Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and...
High Throughput for Real-Time Applications with SMACK, Kafka and Spark StreamingRyan Knight – Solution Engineer, DataStax@knight_cloud
SparkMesosAkka
Cassandra
Kafka
CassandraAkka
SparkKafka
Organize Process Store
Mesos
KafkaKafkaKafka SparkSparkSpark
AkkaAkkaAkka CassandraCassandraCassandra
Data Pipelines
© 2015 DataStax, All Rights Reserved.
5
Move from Proactive to Predictive Analytics• Real time analytics of streaming data• Common use cases – fraud detection, login analysis, web traffic analysis, marketing data• High quality data pipeline = High quality data science• Difficult to deal with the scale and volume of data flowing through enterprises today
Spark Streaming – Predictive Analytics at Scale
• Kafka + Spark Streaming – Ideal tools for handling massive volumes of data
• Built to scale – easy to parallelize and distribute
• Resilient and Fault Tolerant – Ensure data is not lost
© 2015 DataStax, All Rights Reserved. 6
How do we Scale for Load and Traffic?
© 2015 DataStax, All Rights Reserved. 7
Spark Streaming Micro Batches
© 2015 DataStax, All Rights Reserved. 8
1 Use Event Sourcing / Append Only Data Model
2 Avoid Network Shuffles
3 Tune Spark Streaming Processing Time
4 Use Kafka Direct API
5 Size Spark Streaming Batch Sizes
9© 2015 DataStax, All Rights Reserved.
5 Keys to Scaling Spark Streaming w/ Kafka
1 Use Event Sourcing / Append Only Data Model
2 Avoid Network Shuffles
3 Tune Spark Streaming Processing Time
4 Use Kafka Direct API
5 Size Spark Streaming Batch Sizes
10© 2015 DataStax, All Rights Reserved.
5 Keys to Scaling Spark Streaming w/ Kafka
Data Modeling using Event Sourcing• Append-Only Logging
• Database of Facts
• Snapshots or Roll-Ups
• Why Delete Data any more?
• Replay Events© 2015 DataStax, All Rights Reserved. 11
1 Use Event Sourcing / Append Only Data Model
2 Avoid Network Shuffles
3 Tune Spark Streaming Processing Time
4 Use Kafka Direct API
5 Size Spark Streaming Batch Sizes
12© 2015 DataStax, All Rights Reserved.
5 Keys to Scaling Spark Streaming w/ Kafka
• Common use case to join streaming data with lookup tables
• Broadcast Joins• joinWithCassandraTable• Use Data Frames to leverage catalyst optimizer
© 2015 DataStax, All Rights Reserved. 13
Avoid Network Shuffles
1 Use Event Sourcing / Append Only Data Model
2 Avoid Network Shuffles
3 Tune Spark Streaming Processing Time
4 Use Kafka Direct API
5 Size Spark Streaming Batch Sizes
14© 2015 DataStax, All Rights Reserved.
5 Keys to Scaling Spark Streaming w/ Kafka
Tuning Spark Streaming • Processing Time
< Batch Duration
• Total Delay Grows Unbounded = Out Of Memory Errors
© 2015 DataStax, All Rights Reserved. 15
Batch Interval Gone Wrong
© 2015 DataStax, All Rights Reserved. 16
• Scheduling Delay of 41 Minutes!
Setting the Right Batch Interval
© 2015 DataStax, All Rights Reserved. 17
100 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s
• Processing Time is consistently below our Batch Interval Time
• Good approach is to test with a conservative batch interval (e.g. 5-10 seconds) and a low data rate
• If the Total Delay is constantly under the Batch Interval, then the system is stable
1 Use Event Sourcing / Append Only Data Model
2 Avoid Network Shuffles
3 Tune Spark Streaming Processing Time
4 Use Kafka Direct API
5 Size Spark Streaming Batch Sizes
18© 2015 DataStax, All Rights Reserved.
5 Keys to Scaling Spark Streaming w/ Kafka
Kafka High-level Review
19
Anatomy of a Topic
Writes
0 6541 2 3Partition 2
0 6 9541 2 3 7 8Partition 1
0 6 7541 2 3Partition 0
Old New Dataoffsets
Advantage of Kafka Direct API
© 2015 DataStax, All Rights Reserved. 20
20
Advantages Kafka Direct API
• Number of partitions per Kafka Topic = Degree of parallelism
• Simplifies Parallelism• Efficiency – single copy of data on read• Easier to work with• Resiliency without copying data
© 2015 DataStax, All Rights Reserved. 21
1 Use Event Sourcing / Append Only Data Model
2 Avoid Network Shuffles
3 Tune Spark Streaming Processing Time
4 Use Kafka Direct API
5 Size Spark Streaming Batch Sizes
22© 2015 DataStax, All Rights Reserved.
5 Keys to Scaling Spark Streaming w/ Kafka
Reduce Processing Time by Increasing Parallelism
© 2015 DataStax, All Rights Reserved. 23
1 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s 100 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s
Sizing Data Pipeline
• Look at the data flow for the entire pipeline• Benchmarking is key!• Calculate number of messages a single Spark
Streaming server can handle • Calculate number of messages flowing into
Kafka
© 2015 DataStax, All Rights Reserved. 24
Sizing Spark Streaming• Number of CPU Cores is the max number of
Parallel Tasks• RDD (Spark Data Type) internally divided
into Partitions based on data set size• Data transformation on one partition is a
task • Each CPU core can process one task© 2015 DataStax, All Rights Reserved. 25
Java Monitoring of Kafka with JConsole
© 2015 DataStax, All Rights Reserved. 26
Formula for Sizing Spark Streaming
© 2015 DataStax, All Rights Reserved. 27
Total Servers =
Example:
(# of Kafka Messages)
(# of Messages Streaming Server can Process)
100K
20K= Minimum of 5 Servers
Example Architecture
Spark at Scale
© 2015 DataStax, All Rights Reserved. 29
DataStax Enterprise Platform
Web Service
Legacy Systems
https://github.com/retroryan/sparkatscale
DataStax Enterprise Platform
Akka Feeder – Simulates Messagesval feederTick = context.system.scheduler.schedule(Duration.Zero, tickInterval, self, SendNextLine)
……case SendNextLine =>
val record = new ProducerRecord[String, String] (feederExtension.kafkaTopic, key, nxtRating.toString)val future = feederExtension.producer.send(record, new
Callback{ ….
© 2015 DataStax, All Rights Reserved. 30
Spark Streaming – Reading the Messagesval rawRatingsStream = KafkaUtils.createDirectStream ….. ……ratingsStream.foreachRDD { (message: RDD[Rating], batchTime: Time) => {
// convert each RDD from the batch into a Ratings DataFrame val ratingDF = message.toDF()
// save the DataFrame to Cassandra // Note: Cassandra has been initialized through dse spark-submit, so we don't have to explicitly set the connection ratingDF.write.format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace" -> "movie_db", "table" -> "rating_by_movie")) .save() }© 2015 DataStax, All Rights Reserved. 31
Coming Soon!• June 1: Building Data Pipelines with SMACK: Storage Strategy
using Cassandra and DSE
• July 6: Building Data Pipelines with SMACK: Analyzing Data with Spark
• For the latest schedule of webinars, check out our Webinars page: http://www.datastax.com/resources/webinars
© 2015 DataStax, All Rights Reserved. 32
Get your SMACK on!
Thank you!
Follow me on Twitter: @knight_cloud
© 2015 DataStax, All Rights Reserved. 33