Interactive Data Analysis in Spark Streaming

37
Interactive Data Analysis in Spark Streaming Ways to interact with Spark Streaming applications https://github.com/Shasidhar/dynamic-streaming

Transcript of Interactive Data Analysis in Spark Streaming

Page 1: Interactive Data Analysis in Spark Streaming

Interactive Data Analysis in Spark Streaming

Ways to interact with Spark Streaming applicationshttps://github.com/Shasidhar/dynamic-streaming

Page 2: Interactive Data Analysis in Spark Streaming

● Shashidhar E S

● Big data consultant and trainer at datamantra.io

● www.shashidhare.com

Page 3: Interactive Data Analysis in Spark Streaming

Agenda

● Big data applications● Streaming applications● Categories in streaming applications● Interactive streaming applications● Different strategies● Zookeeper● Apache curator● Curator cache types● Spark streaming context dynamic switch

Page 4: Interactive Data Analysis in Spark Streaming

Big data applications

● Typically applications in big data are divided depending on their work loads

● Major divisions are○ Batch applications○ Streaming applications

● Most of the existing systems support both of these applications ● But there is a new category of applications are in raise, they are

known as interactive applications

Page 5: Interactive Data Analysis in Spark Streaming

Big data interactive applications

● Ability to manipulate data in interactive way● Exploratory in nature● Combines batch and streaming data● For development

○ Zeppelin, Jupiter Notebook● For production

○ Batch - Datameer, Tellius, Zoomdata○ Streaming - Stratio Engine, WSO2

Page 6: Interactive Data Analysis in Spark Streaming

Streaming Engines

● Ability to process data in real time● Streaming process includes

○ Collecting data ○ Processing data

● Types of streaming engines○ Real time○ Near real time - (Micro Batch)

● Spark allows near real time streaming processing

Page 7: Interactive Data Analysis in Spark Streaming

Streaming application types/categories

● Streaming ETL processes● Decision engines

○ Rule based○ Machine Learning based

■ Online learning● Real time Dashboards● Root cause analysis engines

○ Multiple Streams○ Handling event times

Page 8: Interactive Data Analysis in Spark Streaming

Streaming applications in real world

● Static○ Data scientist defines rules○ Admin sets up dashboard○ Not able to modify the behaviour of streaming application

● Dynamic○ User can add/delete/modify rules , controls the decision○ User can see some charts/ design charts ○ Ability to modify the behaviour of streaming application

Page 9: Interactive Data Analysis in Spark Streaming

Generic Interactive Application Architecture

Spark Streaming

Streaming data source

Streaming Config source

Downstream applications

How do we make the configuration dynamic?

Page 10: Interactive Data Analysis in Spark Streaming

Spark streaming introduction

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams

Page 11: Interactive Data Analysis in Spark Streaming

Micro batch

● Spark streaming is a fast batch processing system● Spark streaming collects stream data into small batchand runs batch processing on it● Batch can be as small as 1s to as big as multiple hours● Spark job creation and execution overhead is so low it

can do all that under a sec● These batches are called as DStreams

Page 12: Interactive Data Analysis in Spark Streaming

Spark Streaming application

Define Input streams

Define data Processing

Define data sync

Micro Batch

Start Streaming Context

● Options to change behaviour○ Restart context○ Without restarting context

■ Control configuration data

Create Streaming Context

Page 13: Interactive Data Analysis in Spark Streaming

Interactive Streaming Application Strategies

Page 14: Interactive Data Analysis in Spark Streaming

Using Kafka as configuration source

Spark Streaming

Streaming data source

Config source(Kafka)

Downstream applications

Page 15: Interactive Data Analysis in Spark Streaming

Using Kafka as Configuration Source

● Easy to adapt as Kafka is the defacto streaming store● Streaming configuration Source

○ New stream to track the configuration changes● Spark Streaming

○ Maintain configuration as state in memory and apply○ State needs to be checkpointed○ Failure recovery strategies need to be taken care of

● Drawbacks○ Hard to handle deletes/updates in state○ Tricky to handle state if configurations are complex

Page 16: Interactive Data Analysis in Spark Streaming

Using Database as configuration source

Spark Streaming

Streaming data source

Distributed Database

Downstream applications

Workers

Page 17: Interactive Data Analysis in Spark Streaming

Interactive streaming application Strategies contd.● Easy to start with databases, as people are familiar with it● Configuration Source

○ Distributed data source● Spark Streaming

○ Read configuration from database and apply - Polling○ Database need to be consistent and fast○ Configurations can be kept in cache to avoid latencies

● Drawbacks○ Achieving distributed cache consistency is tricky○ May be an extra component if you have it only for this purpose

Page 18: Interactive Data Analysis in Spark Streaming

Using Zookeeper as configuration source

Spark Streaming

Streaming data source

Zookeeper

Downstream applications

Page 19: Interactive Data Analysis in Spark Streaming

Interactive streaming application Strategies contd.

● It is readily available if Kafka is used in a system, no extra burden● Configuration Source - Zookeeper● Spark streaming

○ Ability to track the configuration change and take action - Async Callbacks

○ Suitable to store any type of configuration○ Allows to adapt listeners for configuration changes○ Ensures cache consistency by default

● Drawbacks○ Streaming context restart is not suitable for all systems

Page 20: Interactive Data Analysis in Spark Streaming

Apache Zookeeper

“Zookeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace of data registers”

● Distributed Coordination service ● Hierarchical file system● Data is stored in ZNode● Can be thought as a “distributed in-memory file system” with some

limitations like size of data, optimized of high reads and low writes

Page 21: Interactive Data Analysis in Spark Streaming

Zookeeper Architecture

Page 22: Interactive Data Analysis in Spark Streaming

Zookeeper data model

● Follows hierarchical namespace● Each node is called as ZNode

○ Data saved as bytes○ Can have children○ Only accessible through absolute paths○ Data size limited to 1MB

● Follows global ordering

Page 23: Interactive Data Analysis in Spark Streaming

ZNode

● Types○ Persistent Nodes

■ Exists till explicitly deleted■ Can have children

○ Ephemeral nodes■ Exists as long as session is active■ Cannot have children

● Data can be secured at ZNode level with ACL

Page 24: Interactive Data Analysis in Spark Streaming

Data consistency

● Reads are served from local servers● Writes are synchronised through leader● Ensures Sequential Consistency● Data is either read completely or fails● All clients gets the same result irrespective of the server it is

connected● Updates are persisted, unless overridden by any client

Page 25: Interactive Data Analysis in Spark Streaming

Zookeeper Watches

● Available watches in Zookeeper○ Node Children Changed○ Node Changed○ Node Data Changed○ Node Deleted

● Watchers are one time triggers● Event is always received first rather than data● Client can re register for watch if needed again

Page 26: Interactive Data Analysis in Spark Streaming

Zookeeper Client example

Page 27: Interactive Data Analysis in Spark Streaming

ZK API issues

● Making client code thread safe is tricky● Hard for programmers● Exception handling is bad● Similar to MapReduce API

Solution is “Apache Curator”

Page 28: Interactive Data Analysis in Spark Streaming

Apache Curator

● A Zookeeper Keeper● Main components

○ Client - Wrapper for ZK class, manages Zookeeper connection○ Framework - High level API that encloses all ZK related

operations, handles all types of retries. ○ Recipes - Implementation of common Zookeeper “recipes” built of

top of curator framework● User friendly API

Page 29: Interactive Data Analysis in Spark Streaming

Curator Hands on - Basic OperationsGit branch : zookeeperexamples

Page 30: Interactive Data Analysis in Spark Streaming

Apache Curator caches

Three types of caches

● Node Cache ○ Monitors single node

● Path Cache○ Monitors a ZNode and children

● Tree Cache○ Monitors a ZK Path by caching data locally

Page 31: Interactive Data Analysis in Spark Streaming

Curator Hands on - Node Cache Git branch : zookeeperexamples

Page 32: Interactive Data Analysis in Spark Streaming

Path Cache

● Monitor a ZNode● Using archaius - a dynamic property library● Use ConfigurationSource from archaius to track changes● Pair Configuration source with UpdateListener● See in action

Watched DataSourceUpdate Listener

Page 33: Interactive Data Analysis in Spark Streaming

Curator Hands on - Path CacheGit branch : zookeeperlistener

Page 34: Interactive Data Analysis in Spark Streaming

Spark streaming dynamic restart

● Use the same WatchedSource to track any changes in configuration● Track changes on zookeeper with patch cache ● Control Streaming context restart on ZK data change

Watched DataSource

Update Listener(Restart context)

Page 35: Interactive Data Analysis in Spark Streaming

Hands on - Streaming RestartGit branch : streaming-listener

Page 36: Interactive Data Analysis in Spark Streaming

Ways to control data loss

● Enable checkpointing● Track Kafka topic offsets manually● Better to use Direct kafka input streams● Use Kafka monitoring tools to see the status of data processing● Always create spark streaming context from checkpoint directory

Next Steps● Try to add some meaningful configurations● Implement the same idea with Akka actors