Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

download Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

of 31

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv

PowerPoint Presentation

Streaming Analytics on AWSDmitri TchikatilovAdTech BD,


AgendaStreaming principles Streaming analytics on AWSKinesis and Apache Spark on EMR Querying and Scaling Best Practices

Batch vs. StreamBatch ProcessingStream ProcessingData scopeQueries or processing over all or most of the dataQueries or processing over data on rolling window or most recent data recordData sizeLarge batches of dataIndividual records or micro batches of few recordsPerformanceLatencies in minutes to hours.Requires latency in the order of seconds or milliseconds.AnalyticsComplex analytics.Simple response functions, aggregates, and rolling metrics.


Streaming App Challenges

Simple & Flexible AnalyticsElastic - adapt to input surges and back pressure

Fast ~ 1s to 100ms for the majority of appsScalable ~ 1M records/secAvailable - low tolerance for record losses


Business feels that once the data arrives, it needs to be ready for consumptionTradeof of Usability vs. Performance


We are our choices...

J.P. Sartre

Stream Processing Choices on AWSOperations AnalyticsStormZookeeper/Nimbus for HASQL - 3rd party, roll your ownKafkaZookeeper (failure detection, partitioning, replication)SQL - 3rd party, roll your ownDruidZookeeper, multiple node roles scale independentlyOLAP engine (JSON) on denormalized data, real time indexingKinesisAWS ServiceSQL - Kinesis Analytics (in development)Spark StreamingEMR bootstraps latest 1.6, Yarn, MonitoringSparkSQL on DataFrames, Joins, Zeppelin notebooks

Do we need operation of joining the data?

SQL for Storm: for Kafka: (9 contributors)6

ComponentsStorage layerIngest (record storing, ordering, strong consistency and replayable reads)

StorageProcessingProcessing layerAnalytics (consume data from storage layer, run computations, removal from storage)


Real-Time Streaming Data Ingestion

Custom-built Streaming Applications(KCL)Inexpensive: $0.014 per 1,000,000 PUT Payload Units

Storage - Amazon Kinesis Streams

Kinesis Stream1 Shard< 1MB-in / 2MB-outEach record < 1 MBPutRecords() < 500 (5MB)Increased retention 7 days


Processing - Spark Streaming


Input data streams


Results published to destinationsDStream

RDD = Resilient Distributed DatasetDStream = Collection of RDDs

The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.9

Spark Steaming Long Running Spark App

Driver ProgramStreamingContextSparkContextSpark jobs toprocess received data

Worker Node


Long Task


Worker Node





Input stream

Worker Node processes thedata

Output Batch

RECEIVERS running within application executors that collect data from the input source and save to RDDsStreaming Context in the Driver program periodically runs Spark jobs to process the data and combine it with RDDs fro the previous time stepsBatch interval 1 secondDStreams support TRANSFORMATIONS and OUTPUT OPERATIONS similar to RDD ACTIONS10

Analytics - DataFrames on Streaming DataKCL Kinesis Client Library (helps take data off Kinesis)Spark Streaming uses KCL - reads data from Kinesis and forms a DStream (Pull Mechanism)Creates DataFrame in Spark Streaming



Remember to set the time streaming remembers data for your ad hoc queriesLoad into static table

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.11

Kinesis and Spark Streaming


Full Kinesis + Spark Pipeline

What About Analytics? What operations are possible?Filter, GroupBy, Join, Window Operations

Not all queries make sense to run on the stream.Large joins on RDDs in DStreams can be expensive


Spark Streaming Operations on DStreamsWindow Operations

As shown in the figure, every time the window slides over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream. In this specific case, the operation is applied over the last 3 time units of data, and slides by 2 time units. This shows that any window operation needs to specify two parameters.window length - The duration of the window (3 in the figure).sliding interval - The interval at which the window operation is performed (2 in the figure).These two parameters must be multiples of the batch interval of the source DStream (1 in the figure).Lets illustrate the window operations with an example. Say, you want to extend the earlier example by generating word counts over the last 30 seconds of data, every 10 seconds. To do this, we have to apply the reduceByKey operation on the pairs DStream of (word, 1) pairs over the last 30 seconds of data. This is done using the operation reduceByKeyAndWindow.


Query the Data in DStreams? This is all great, but Id like to query my data! StreamingContext > DStream (RDDs) > DataFrame

DataFrame converted to temp. table and query with SQL through HiveContext

Example: Querying DStreams with SQL

CourtesyAmo Abeyarante AWS Big Data Blog

Amo Abeyaratne

Now that you have created an Amazon Kinesis stream and developed a Python application that feeds sample data to the stream, take a look at the main component of this solution, the Spark Streaming application written using Scala code. Download the sample code for this application.This is an Amazon Kinesis consumer application developed using the Kinesis Client Library for spark. It captures data ingested through Amazon Kinesis into small batches of data based on the defined batch interval. Those batches (technically, RDDs) are converted to DataFrames and eventually converted to temporary tables. The application code presents such a temporary table through the HiveServer2 instance that is started during its runtime, and the temporary table name is passed as an argument to the application.Additionally, I have also included an action such as .saveAsTable("permKinesisTable",Append) for demonstrating how to persist the batches on HDFS by appending every batch into a permanent table. This table is accessible through the Hive metastore under the name permKinesisTable. Instead of HDFS, you may also create a S3 bucket and point to the S3 location.


SetupKinesis Stream with data provided by Python scriptKCL Scala app launched as spark-jobChecks the number of Shards and instantiates the same number of StreamsReceives data from Kinesis in small batchesCreates DataFrame, registers as temp table Creates HiveContext3. Use Hive app to query the data

Demo Querying Streams

Analytics Choosing Where to Join DataJoin the data in a custom KCL app denormalize and publish to another Kinesis Stream

StorageProcessingJoin the streaming data using DStreams

Amazon Kinesis + Spark on EMRProducer 1Producer 2Producer NShard1Shard2


Receiver 1KCL Worker 1Yarn Executor 1

RecordProcessor 1

RecordProcessor 2


Yarn Executor 2

Create DStream to Scale Outfrom pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream

kinesisStream = KinesisUtils.createStream(streamingContext, [Kinesis app name], [Kinesis stream name], [endpoint URL], [region name], [initial position], [checkpoint interval], StorageLevel.MEMORY_AND_DISK_2)

Amazon Kinesis + Spark on EMRProducer 1Producer 2Producer NShard1Shard2


Receiver 1KCL Worker 1Yarn Executor 1

RecordProcessor 1


Yarn Executor 2

KCL Worker 2Receiver 2

RecordProcessor 2

Scaling KinesisKinesis Can accumulate data at any rate, but need input batching for high rates of small messages to optimize costScales inputs by splitting shards Never pressures Spark Spark and KCL is pulling data

Every app need to be stable. 24

Scaling EMR/SparkEMR/SparkScales by adding task nodes can be EC2 Spot instancesYarn can be configured for dynamic resource allocation with variable number of executors per app. New default for the upcoming EMR 4.4 release Works well for batch but not always for Streaming Automatic same number of Receivers (in case of a shard split/merge operations)Manual (app restart) if you need to change the number of Receivers

DOES NOT WORK WELL FOR Streamig jobs all the executors are running some tasks every couple of seconds. NO single executor is idle for a long duration25

Stability in Spark Streaming2s2s2s0s4s8sTb (batch) = 4sTp (process) = 2s5s5s0s4s8sTb (batch) = 4sTp (process) = 5sStable Tb TpUnstable state increase in scheduling delay

Scheduling delay5s

Batch interval by the time previous batch Scheduling delay builds up operation condition is unstable... Spark 1.5 introduced backpressure concept

Batch processing times and scheduling delays used tocontinuously estimate current processing rates

As demonstrated above, at minimum you need to ensure that every executor has