Post on 16-Apr-2017
Interactive Data Analysis in Spark Streaming
Ways to interact with Spark Streaming applicationshttps://github.com/Shasidhar/dynamic-streaming
● Shashidhar E S
● Big data consultant and trainer at datamantra.io
● www.shashidhare.com
Agenda
● Big data applications● Streaming applications● Categories in streaming applications● Interactive streaming applications● Different strategies● Zookeeper● Apache curator● Curator cache types● Spark streaming context dynamic switch
Big data applications
● Typically applications in big data are divided depending on their work loads
● Major divisions are○ Batch applications○ Streaming applications
● Most of the existing systems support both of these applications ● But there is a new category of applications are in raise, they are
known as interactive applications
Big data interactive applications
● Ability to manipulate data in interactive way● Exploratory in nature● Combines batch and streaming data● For development
○ Zeppelin, Jupiter Notebook● For production
○ Batch - Datameer, Tellius, Zoomdata○ Streaming - Stratio Engine, WSO2
Streaming Engines
● Ability to process data in real time● Streaming process includes
○ Collecting data ○ Processing data
● Types of streaming engines○ Real time○ Near real time - (Micro Batch)
● Spark allows near real time streaming processing
Streaming application types/categories
● Streaming ETL processes● Decision engines
○ Rule based○ Machine Learning based
■ Online learning● Real time Dashboards● Root cause analysis engines
○ Multiple Streams○ Handling event times
Streaming applications in real world
● Static○ Data scientist defines rules○ Admin sets up dashboard○ Not able to modify the behaviour of streaming application
● Dynamic○ User can add/delete/modify rules , controls the decision○ User can see some charts/ design charts ○ Ability to modify the behaviour of streaming application
Generic Interactive Application Architecture
Spark Streaming
Streaming data source
Streaming Config source
Downstream applications
How do we make the configuration dynamic?
Spark streaming introduction
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams
Micro batch
● Spark streaming is a fast batch processing system● Spark streaming collects stream data into small batchand runs batch processing on it● Batch can be as small as 1s to as big as multiple hours● Spark job creation and execution overhead is so low it
can do all that under a sec● These batches are called as DStreams
Spark Streaming application
Define Input streams
Define data Processing
Define data sync
Micro Batch
Start Streaming Context
● Options to change behaviour○ Restart context○ Without restarting context
■ Control configuration data
Create Streaming Context
Interactive Streaming Application Strategies
Using Kafka as configuration source
Spark Streaming
Streaming data source
Config source(Kafka)
Downstream applications
Using Kafka as Configuration Source
● Easy to adapt as Kafka is the defacto streaming store● Streaming configuration Source
○ New stream to track the configuration changes● Spark Streaming
○ Maintain configuration as state in memory and apply○ State needs to be checkpointed○ Failure recovery strategies need to be taken care of
● Drawbacks○ Hard to handle deletes/updates in state○ Tricky to handle state if configurations are complex
Using Database as configuration source
Spark Streaming
Streaming data source
Distributed Database
Downstream applications
Workers
Interactive streaming application Strategies contd.● Easy to start with databases, as people are familiar with it● Configuration Source
○ Distributed data source● Spark Streaming
○ Read configuration from database and apply - Polling○ Database need to be consistent and fast○ Configurations can be kept in cache to avoid latencies
● Drawbacks○ Achieving distributed cache consistency is tricky○ May be an extra component if you have it only for this purpose
Using Zookeeper as configuration source
Spark Streaming
Streaming data source
Zookeeper
Downstream applications
Interactive streaming application Strategies contd.
● It is readily available if Kafka is used in a system, no extra burden● Configuration Source - Zookeeper● Spark streaming
○ Ability to track the configuration change and take action - Async Callbacks
○ Suitable to store any type of configuration○ Allows to adapt listeners for configuration changes○ Ensures cache consistency by default
● Drawbacks○ Streaming context restart is not suitable for all systems
Apache Zookeeper
“Zookeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace of data registers”
● Distributed Coordination service ● Hierarchical file system● Data is stored in ZNode● Can be thought as a “distributed in-memory file system” with some
limitations like size of data, optimized of high reads and low writes
Zookeeper Architecture
Zookeeper data model
● Follows hierarchical namespace● Each node is called as ZNode
○ Data saved as bytes○ Can have children○ Only accessible through absolute paths○ Data size limited to 1MB
● Follows global ordering
ZNode
● Types○ Persistent Nodes
■ Exists till explicitly deleted■ Can have children
○ Ephemeral nodes■ Exists as long as session is active■ Cannot have children
● Data can be secured at ZNode level with ACL
Data consistency
● Reads are served from local servers● Writes are synchronised through leader● Ensures Sequential Consistency● Data is either read completely or fails● All clients gets the same result irrespective of the server it is
connected● Updates are persisted, unless overridden by any client
Zookeeper Watches
● Available watches in Zookeeper○ Node Children Changed○ Node Changed○ Node Data Changed○ Node Deleted
● Watchers are one time triggers● Event is always received first rather than data● Client can re register for watch if needed again
Zookeeper Client example
ZK API issues
● Making client code thread safe is tricky● Hard for programmers● Exception handling is bad● Similar to MapReduce API
Solution is “Apache Curator”
Apache Curator
● A Zookeeper Keeper● Main components
○ Client - Wrapper for ZK class, manages Zookeeper connection○ Framework - High level API that encloses all ZK related
operations, handles all types of retries. ○ Recipes - Implementation of common Zookeeper “recipes” built of
top of curator framework● User friendly API
Curator Hands on - Basic OperationsGit branch : zookeeperexamples
Apache Curator caches
Three types of caches
● Node Cache ○ Monitors single node
● Path Cache○ Monitors a ZNode and children
● Tree Cache○ Monitors a ZK Path by caching data locally
Curator Hands on - Node Cache Git branch : zookeeperexamples
Path Cache
● Monitor a ZNode● Using archaius - a dynamic property library● Use ConfigurationSource from archaius to track changes● Pair Configuration source with UpdateListener● See in action
Watched DataSourceUpdate Listener
Curator Hands on - Path CacheGit branch : zookeeperlistener
Spark streaming dynamic restart
● Use the same WatchedSource to track any changes in configuration● Track changes on zookeeper with patch cache ● Control Streaming context restart on ZK data change
Watched DataSource
Update Listener(Restart context)
Hands on - Streaming RestartGit branch : streaming-listener
Ways to control data loss
● Enable checkpointing● Track Kafka topic offsets manually● Better to use Direct kafka input streams● Use Kafka monitoring tools to see the status of data processing● Always create spark streaming context from checkpoint directory
Next Steps● Try to add some meaningful configurations● Implement the same idea with Akka actors
References● http://sysgears.com/articles/managing-configuration-of-distributed-system-wit
h-apache-zookeeper/● https://github.com/Netflix/archaius/wiki/ZooKeeper-Dynamic-Configuration