Apache Spark and Online Analytics

download Apache Spark and Online Analytics

of 21

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Apache Spark and Online Analytics


Spark and Online AnalyticsSudarshan Kadambi

Copyright 2016 Bloomberg L.P. All rights reserved.

AgendaData and Analytics at BloombergThe role of SparkThe Bloomberg Spark ServerSpark for Online use-cases

Data and Analytics is our Business

Analytics at BloombergHuman-time, interactive analyticsScalability Handle increasingly sophisticated client analytic workflowsAd-hoc and cross-domain aggregations, filteringHeterogeneous data storesAnalytics often requires data from multiple storesLow-latency updates, in addition to queries

Spark for Bloomberg AnalyticsDistributed compute scales well for large security universes and multi-universe cross-domain queriesAbstract away heterogeneous data sources and present consistent interface for efficient data accessSpark as a tool for systems integrationConnectors and primitives to deal with incoming streams Cache intermediate compute for fast queries

Spark as a Service?6Stand-alone Spark Apps on isolated clusters pose challenges:

Redundancy in:

Crafting and Managing of RDDs/DFs

Coding of the same or similar types of transforms/actions

Management of clusters, replication of data, etc.

Analytics are confined to specific content sets making Cross-Asset Analytics much harder

Need to handle Real-time ingestion in each AppSpark ClusterSpark App

Spark Cluster

Spark Server

Spark AppSpark App

Spark ClusterSpark App

Bloomberg Spark Server

Spark ContextRequest ProcessorRequest ProcessorRequest ProcessorRequest Handler

MDF Registry7

Function Transform Registry (FTR)RSI


7Ingestion ManagerMDF1MDF21212


Spark Server: Content CachingData access has long tail characteristicsHigh value data sub-setted within SparkSpecified as a filter predicate at time of registrationSeamless unification of data in Spark and backing store

Spark HA: State of the WorldExecution lineage in DriverRecovery from lost RDDsRDD ReplicationLow latency, even with lost executorsSupport for MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_ONLY_SER, MEMORY_ONLY_SER_2 modes for in-memory persistence. Easily extensible to more replicas if needed.Speculative executionMinimizing performance hit from stragglersOff-heap dataMinimizing GC stalls

Spark Architecture

RPC Environment

RPC EnvironmentBlockManagerMasterEndpointBlockManagerMasterDriverExecutor - 2


RPC EnvironmentBlockManagerMasterExecutor - 1

BlockManagerRDD - 0RDDBlock(0, Partition-1)RDDBlock(0, Partition-2)


RDD Block ReplicationExecutor-1Executor-2DriverCompute RDDComputation completeGet Peers for replicationList of PeersReplicate block to Peer Block stored locallyResults of computation

RDD Block Replication: ChallengesLost RDD partitions costly to recoverData replenished at query time

RDD replicated to random executorsOn YARN, multiple executors can be brought up on the same node in different containersHence multiple replicas possible on the same node/rack, susceptible to node/rack failureLost block replicas not recovered proactively

Topology Aware Replication (SPARK-15352)Ideas & Implementation by Shubham Chopra

Making Peer selection for replication pluggableDriver gets topology information for executorsExecutors informed about this topology informationExecutors use prioritization logic to order peers for block replicationPluggable TopologyMapper and BlockReplicationPrioritizerDefault implementation replicates current Spark behavior

Topology Aware Replication (SPARK-15352)Customizable prioritization strategies to suit different deploymentsVariety of replication objectives ReplicateToDifferentHost, ReplicateBlockWithinRack, ReplicateBlockOutsideRackOptimizer to find a minimum number of peers to meet the objectivesReplicate to these peers with a higher priority

Proactive replenishment of lost replicasBlockManagerMasterEndpoint triggered replenishment when an executor failure is detected.

Spark HA: ChallengesHigh Availability of Spark DriverHigh bootstrap cost to reconstructing cluster and cached stateNave HA models (such as multiple active clusters) surface query inconsistency

High Availability and Low Tail Latency closely related

Spark HA A StrawmanMultiple Spark Servers in Leader-Standby configurationEach Spark Server backed by a different Spark ClusterEach Spark Server refreshed with up-to-date dataQueries to standbys redirected to leaderOnly leader responds to queries - Data consistencyRDD Partition loss in the leader still a concernPerformance still gated by slowest executor in leaderResource usage amplified by the number of Spark Servers

Spark Driver StateSpark Driver is an arbitrary Java applicationOnly a subset of the state is interesting or expensive to reconstructFor online-use cases, only RDDs/DFs created during ingestion are of interestExpressing ingestion using DFs has better decoupling of data/state than RDDs

Spark Driver State*BlockManagerMasterEndpoint holds BlockExecutor assignmentCache Manager holds Logical Plan and DataFrame referencesUsed to short-circuit queries with pre-cached query plans, if possibleJob SchedulerKeeps a track of various stages and tasks being scheduledExecutor informationHostname and ports of live executors

*Illustrative, not exhaustive

Externalizing Driver StateBenefits:Quicker recoveriesNo need to restart executorsState accessible from multiple Active-Active drivers

Solutions:Off-heap storage for RDDsResidual book-keeping driver state externalized to ZooKeeper

Quorum of Drivers

THANK YOU.skadambi@bloomberg.net