Interactive Data Analysis in Spark Streaming

Post on 16-Apr-2017

293 views 0 download

Transcript of Interactive Data Analysis in Spark Streaming

Interactive Data Analysis in Spark Streaming

Ways to interact with Spark Streaming applicationshttps://github.com/Shasidhar/dynamic-streaming

● Shashidhar E S

● Big data consultant and trainer at datamantra.io

● www.shashidhare.com

Agenda

● Big data applications● Streaming applications● Categories in streaming applications● Interactive streaming applications● Different strategies● Zookeeper● Apache curator● Curator cache types● Spark streaming context dynamic switch

Big data applications

● Typically applications in big data are divided depending on their work loads

● Major divisions are○ Batch applications○ Streaming applications

● Most of the existing systems support both of these applications ● But there is a new category of applications are in raise, they are

known as interactive applications

Big data interactive applications

● Ability to manipulate data in interactive way● Exploratory in nature● Combines batch and streaming data● For development

○ Zeppelin, Jupiter Notebook● For production

○ Batch - Datameer, Tellius, Zoomdata○ Streaming - Stratio Engine, WSO2

Streaming Engines

● Ability to process data in real time● Streaming process includes

○ Collecting data ○ Processing data

● Types of streaming engines○ Real time○ Near real time - (Micro Batch)

● Spark allows near real time streaming processing

Streaming application types/categories

● Streaming ETL processes● Decision engines

○ Rule based○ Machine Learning based

■ Online learning● Real time Dashboards● Root cause analysis engines

○ Multiple Streams○ Handling event times

Streaming applications in real world

● Static○ Data scientist defines rules○ Admin sets up dashboard○ Not able to modify the behaviour of streaming application

● Dynamic○ User can add/delete/modify rules , controls the decision○ User can see some charts/ design charts ○ Ability to modify the behaviour of streaming application

Generic Interactive Application Architecture

Spark Streaming

Streaming data source

Streaming Config source

Downstream applications

How do we make the configuration dynamic?

Spark streaming introduction

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams

Micro batch

● Spark streaming is a fast batch processing system● Spark streaming collects stream data into small batchand runs batch processing on it● Batch can be as small as 1s to as big as multiple hours● Spark job creation and execution overhead is so low it

can do all that under a sec● These batches are called as DStreams

Spark Streaming application

Define Input streams

Define data Processing

Define data sync

Micro Batch

Start Streaming Context

● Options to change behaviour○ Restart context○ Without restarting context

■ Control configuration data

Create Streaming Context

Interactive Streaming Application Strategies

Using Kafka as configuration source

Spark Streaming

Streaming data source

Config source(Kafka)

Downstream applications

Using Kafka as Configuration Source

● Easy to adapt as Kafka is the defacto streaming store● Streaming configuration Source

○ New stream to track the configuration changes● Spark Streaming

○ Maintain configuration as state in memory and apply○ State needs to be checkpointed○ Failure recovery strategies need to be taken care of

● Drawbacks○ Hard to handle deletes/updates in state○ Tricky to handle state if configurations are complex

Using Database as configuration source

Spark Streaming

Streaming data source

Distributed Database

Downstream applications

Workers

Interactive streaming application Strategies contd.● Easy to start with databases, as people are familiar with it● Configuration Source

○ Distributed data source● Spark Streaming

○ Read configuration from database and apply - Polling○ Database need to be consistent and fast○ Configurations can be kept in cache to avoid latencies

● Drawbacks○ Achieving distributed cache consistency is tricky○ May be an extra component if you have it only for this purpose

Using Zookeeper as configuration source

Spark Streaming

Streaming data source

Zookeeper

Downstream applications

Interactive streaming application Strategies contd.

● It is readily available if Kafka is used in a system, no extra burden● Configuration Source - Zookeeper● Spark streaming

○ Ability to track the configuration change and take action - Async Callbacks

○ Suitable to store any type of configuration○ Allows to adapt listeners for configuration changes○ Ensures cache consistency by default

● Drawbacks○ Streaming context restart is not suitable for all systems

Apache Zookeeper

“Zookeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace of data registers”

● Distributed Coordination service ● Hierarchical file system● Data is stored in ZNode● Can be thought as a “distributed in-memory file system” with some

limitations like size of data, optimized of high reads and low writes

Zookeeper Architecture

Zookeeper data model

● Follows hierarchical namespace● Each node is called as ZNode

○ Data saved as bytes○ Can have children○ Only accessible through absolute paths○ Data size limited to 1MB

● Follows global ordering

ZNode

● Types○ Persistent Nodes

■ Exists till explicitly deleted■ Can have children

○ Ephemeral nodes■ Exists as long as session is active■ Cannot have children

● Data can be secured at ZNode level with ACL

Data consistency

● Reads are served from local servers● Writes are synchronised through leader● Ensures Sequential Consistency● Data is either read completely or fails● All clients gets the same result irrespective of the server it is

connected● Updates are persisted, unless overridden by any client

Zookeeper Watches

● Available watches in Zookeeper○ Node Children Changed○ Node Changed○ Node Data Changed○ Node Deleted

● Watchers are one time triggers● Event is always received first rather than data● Client can re register for watch if needed again

Zookeeper Client example

ZK API issues

● Making client code thread safe is tricky● Hard for programmers● Exception handling is bad● Similar to MapReduce API

Solution is “Apache Curator”

Apache Curator

● A Zookeeper Keeper● Main components

○ Client - Wrapper for ZK class, manages Zookeeper connection○ Framework - High level API that encloses all ZK related

operations, handles all types of retries. ○ Recipes - Implementation of common Zookeeper “recipes” built of

top of curator framework● User friendly API

Curator Hands on - Basic OperationsGit branch : zookeeperexamples

Apache Curator caches

Three types of caches

● Node Cache ○ Monitors single node

● Path Cache○ Monitors a ZNode and children

● Tree Cache○ Monitors a ZK Path by caching data locally

Curator Hands on - Node Cache Git branch : zookeeperexamples

Path Cache

● Monitor a ZNode● Using archaius - a dynamic property library● Use ConfigurationSource from archaius to track changes● Pair Configuration source with UpdateListener● See in action

Watched DataSourceUpdate Listener

Curator Hands on - Path CacheGit branch : zookeeperlistener

Spark streaming dynamic restart

● Use the same WatchedSource to track any changes in configuration● Track changes on zookeeper with patch cache ● Control Streaming context restart on ZK data change

Watched DataSource

Update Listener(Restart context)

Hands on - Streaming RestartGit branch : streaming-listener

Ways to control data loss

● Enable checkpointing● Track Kafka topic offsets manually● Better to use Direct kafka input streams● Use Kafka monitoring tools to see the status of data processing● Always create spark streaming context from checkpoint directory

Next Steps● Try to add some meaningful configurations● Implement the same idea with Akka actors