Seattle spark-meetup-032317

download Seattle spark-meetup-032317

If you can't read please download the document

Embed Size (px)

Transcript of Seattle spark-meetup-032317

Design and Implementation of Spark Streaming Connectors

Design and Implementation of Spark Streaming ConnectorsArijit TarafdarNan ZhuMARCH 23,2017

About UsArijit TarafdarSoftware Engineer@Azure HDInsight. Work on Spark Streaming/Structured Streaming service in Azure. Committee Member of XGBoost@DMLC and Apache MxNet (incubator). Spark Contributor. Known as CodingCat in GitHub.Nan ZhuSoftware Engineer@Azure HDInsight. Work on Spark/Spark Streaming on Azure. Previously worked with other distributed platforms like DryadLinq and MPI. Also worked on graph coloring algorithms which was contributed to ADOL-C (https://projects.coin-or.org/ADOL-C).

2

Continuous Application Architecture and Role of Spark Connectors

Not only size of data is increasing, but also the velocity of dataSensors, IoT devices, social networks and online transactions are all generating data that needs to be monitored constantly and acted upon quickly.

Two types of datasetsBounded: Finite, unchanging datasetsUnbounded: Infinite datasets that are appended to continuouslyUnbounded data is generated all the time and we want to know nowGlue between unbounded data source like event hubs and powerful processing engine like SparkGoal is to deliver near real time analysis or view.

3

OutlineRecap of Spark StreamingIntroduction to Event HubsConnecting Azure Event Hubs and Spark StreamingDesign Considerations for Spark Streaming ConnectorContributions Back to CommunityFuture Work

Spark Streaming - Background

Micro-batching mechanism, processes continuous and infinite data sourceBatch scheduled at regular time interval or after certain number of events receivedDistributed Stream is the highest level abstraction over continuous creation and expiration of RDDsBatch duration single RDD generatedWindow duration multiple of batch duration, may use multiple RDDsRDDs contains partitions, one task per partitions5

Azure Event Hubs - Introduction

High throughput, low latency offered as platform as a service on AzureNo cluster set up required, no monitoring requiredUser can concentrate only on ingress and egress of dataEvent hubs namespace collection of event hubs, an event hub is a collection of partitions, a partition is a sequential collection of eventsUp to 32 partitions per event hub but can be increased if required6

Azure Event Hubs - Introduction

https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-what-is-event-hubs

-HTTP or AMQP with transport level security (TLS/SSL)-HTTP has higher message transmission overhead-AMQP has higher connection setup overhead-Consumer group gives logical view of event hubs partitions, including addressing same partition at different offsets-Up to 20 consumer groups per event hubs-1 receiver per consumer group7

Data Flow Event HubsProactive message deliveryEfficient in terms of communication costData source treated as commit log of eventsEvents read in batch per receive() call

Each partition can be viewed as a commit logEvent Hubs client maintains prefetch queue to proactively get messages from the serverReceive call by application gets messages in batch from the prefetch queue to the caller.8

Event Hubs Offset ManagementEvent Hubs expect offset management to be performed on the receiver sideSpark streaming uses DFS based persistent store (HDFS, ADLS, etc.)Offset is stored per consumer group per partition per event hubs per event hubs namespace/* An interface to read/write offset for a given Event Hubs namespace/name/partition */ @SerialVersionUID(1L) trait OffsetStore extends Serializable { def open(): Unit def write(offset: String): Unit def read() : String def close(): Unit}

No support from Event Hubs server yetOffset is managed by the Event Hubs connector at the Spark application sideUses distributed file system like HDFS, ADLS, etc.Offset is stored per consumer group, per partition, per event hub, per event hub namespaceEvent hubs clients are initialized with an initial offset from the which event hubs will start sending dataOffset is determined in one of three ways start of stream, previously saved offset, enqueue time

9

First Version: Receiver-based Spark Streaming Connector for Azure Event Hub

- How do we bridge10

Fault Tolerance Spark Receiver Based Event Hubs Connector

Reliable receivers received data backed up in a reliable persistent store (WAL), no data lost between application restartsReliable receivers offset saved after saving to persistent store and pushing to block managerBoth executors and driver use the WAL11

Spark Streaming Recovery After Failure

On application restart data is processed from the WAL first up to the offset saved before the previous application stopReceiver tasks then start the event hubs clients, one per partition with the last offset saved for each partition.

12

Event Hubs Receiver Class Signatureprivate[eventhubs] class EventHubsReceiver( eventhubsParams: Map[String, String], partitionId: String, storageLevel: StorageLevel, offsetStore: Option[OffsetStore], receiverClient: EventHubsClientWrapper, maximumEventRate: Int) extends Receiver[Array[Byte]](storageLevel) with Logging { ... }

Describe each parameter.Extends spark provided Receiver class with specific type of Array of Bytes which is the exact content of the user data per event.Storage level whether to spill to disk when memory usage reaches capacity.13

Event Hubs Receiver Methods Used/Implemented@DeveloperApiabstract class Receiver[T](val storageLevel: StorageLevel) extends Serializable {

def onStart(): Unit

def onStop(): Unit

def store(dataItem: T) { supervisor.pushSingle(dataItem) } def store(dataBuffer: ArrayBuffer[T]) { supervisor.pushArrayBuffer(dataBuffer, None, None) }

def restart(message: String, error: Throwable) { supervisor.restartReceiver(message, Some(error)) }}

On start establishes connections to event hubsOn stop cleans up the connectionsReliably store data to block managerRestart call stop and start.14

Azure Event Hubs/Spark ExpectationsReceiver-based Connection(Event Hubs) Long-Running Receiver/Proactive Message FetchingLong-Running Receiver Tasks(Spark) Logging Data Before AckWAL/Spark Checkpoint(Event Hubs) Client-side Offset ManagementOffset Store

A Natural Fit!!!Why Receiver based connector?

Requirements in Event HubsReceiver-based ConnectionProblemsLong-Running Receiver/Proactive Message FetchingLong-Running Receiver TasksExtra Resource Requirements

Lessons learnt from based connector?

Requirements in Event HubsReceiver-based ConnectionProblemsLong-Running Receiver/Proactive Message FetchingLong-Running Receiver TasksExtra Resource RequirementsLogging Data Before AckWAL/Spark CheckpointPerformance/Data loss due to Spark bug/No easy recovery from code update

https://issues.apache.org/jira/browse/SPARK-18957Lessons learnt from based connector?

Requirements in Event HubsReceiver-based ConnectionProblemsLong-Running Receiver/Proactive Message FetchingLong-Running Receiver TasksExtra Resource RequirementsLogging Data Before AckWAL/Spark CheckpointPerformance/Data loss due to Spark bug/No easy recovery from code updateClient-side Offset ManagementOffset StoreLooks fine.

Lessons learnt from based connector?

Bridging Spark Streaming and Event Hubs WITHOUT ReceiverHow the Idea Extends to Other Data Sources (in Azure & Your IT Infrastructure)?

Extra Resources

Requirements in Event HubsReceiver-based ConnectionProblemsLong-Running Receiver/Proactive Message FetchingLong-Running Receiver TasksExtra Resource RequirementsFault tolerance MechanismWAL/Spark CheckpointPerf./Data Loss due to Spark Bug/No Recovery from Code UpdateClient-side Offset ManagementOffset StoreLooks fine.

From Event Hubs to General Data Sources (1)Communication PatternAzure Event Hubs: Long-Running Receiver, Proactive Data DeliveryKafka: Receiver Start/Shutdown in a free-style, Passive Data Delivery

Most Critical Factor in Designing a Resource-Efficient Spark Streaming Connector!

21

Tackling Extra Resource RequirementAzure EventHubsEvH-Namespace-1EventHub-1P1PN...Reduce Resource Requirements:

Customized ReceiverLogic

User-Defined Lambdas

EventHubsRDD

.map()MapPartitionsRDD

Spark TasksCompact Data Receiving and Processing in the same TaskInspired by Kafka Direct DStream! Being More Challenging with a Different Communication Pattern!

Bridging Spark Execution Model and Communication Pattern ExpectationAzure EventHubsEvH-Namespace-1EventHub-1P1PN

...

Customized Receiver Logic

User-Defined Lambdas

EventHubsRDD

.map()MapPartitionsRDD

Spark Task

Passive Message Delivery LayerRecv(expectedMsgNum: Int) Blocking APILong-running/Proactive Receiver expected by Event Hubs vs. Transient Tasks started for each Batch by Spark

Takeaways (1)Requirements in Event HubsReceiver-based ConnectionProblemsSolutionLong-Running Receiver/Proactive Message FetchingLong-Running Receiver TasksExtra Resource RequirementsCompact Data Receiving/Processing, with the facilitates from Passive Message Delivery

Communication Pattern in Data Sources Plays the Key Role in Resource-Efficient Design of Spark Streaming Connector

Next Problem

Fault Tolerance

Requirements in Event HubsReceiver-based ConnectionProblemsLong-Running Receiver/Proactive Message FetchingLong-Running Receiver TasksExtra Resource RequirementsFault tolerance MechanismWAL/Spark CheckpointPerf./Data Loss due to Spark Bug/No Recovery from Code UpdateClient-side Offset ManagementOffset StoreLooks fine.