Apache Spark and Online Analytics

Spark and Online Analytics

Sudarshan Kadambi

Copyright 2016 Bloomberg L.P. All rights reserved.

Agenda• Data and Analytics at Bloomberg• The role of Spark• The Bloomberg Spark Server• Spark for Online use-cases

Data and Analytics is our Business

Analytics at Bloomberg• Human-time, interactive analytics• Scalability

– Handle increasingly sophisticated client analytic workflows– Ad-hoc and cross-domain aggregations, filtering

• Heterogeneous data stores– Analytics often requires data from multiple stores

• Low-latency updates, in addition to queries

Spark for Bloomberg Analytics• Distributed compute scales well for

– large security universes and – multi-universe cross-domain queries

• Abstract away heterogeneous data sources and present consistent interface for efficient data access

– Spark as a tool for systems integration

• Connectors and primitives to deal with incoming streams • Cache intermediate compute for fast queries

Spark as a Service?

6

• Stand-alone Spark Apps on isolated clusters pose challenges:

– Redundancy in:

» Crafting and Managing of RDDs/DFs

» Coding of the same or similar types of transforms/actions

– Management of clusters, replication of data, etc.

– Analytics are confined to specific content sets making Cross-Asset Analytics much harder

– Need to handle Real-time ingestion in each App

Spark Cluster

Spark App

Spark Cluster

Spark Server

Spark App

Spark App

Spark Cluster

Spark App

Bloomberg Spark Server

Spark Context

Request Processor

Request Processor

Request Processor

Request Handler

MDF Registry

7

Function Transform Registry (FTR)

RSI …

use

7

Ingestion ManagerMDF1

MDF2

1 2

1 2

Spark Server: Content Caching• Data access has long tail characteristics

• High value data sub-setted within Spark

• Specified as a filter predicate at time of registration

• Seamless unification of data in Spark and backing store

Spark HA: State of the World– Execution lineage in Driver

• Recovery from lost RDDs

– RDD Replication• Low latency, even with lost executors• Support for “MEMORY_ONLY”, “MEMORY_ONLY_2”, “MEMORY_ONLY_SER”,

“MEMORY_ONLY_SER_2” modes for in-memory persistence. Easily extensible to more replicas if needed.

– Speculative execution• Minimizing performance hit from stragglers

– Off-heap data• Minimizing GC stalls

Spark Architecture

RPC Environment RPC Environment

BlockManagerMasterEndpoint BlockManagerMaster

Driver Executor - 2

BlockManager

RPC Environment

BlockManagerMaster

Executor - 1

BlockManagerRDD - 0

RDDBlock(0, Partition-1)

RDDBlock(0, Partition-2)

RDD Block ReplicationExecutor-1 Executor-2Driver

Compute RDD

Computation complete Get Peers for replication

List of Peers

Replicate block to Peer Block stored locallyResults of computation

RDD Block Replication: Challenges– Lost RDD partitions costly to recover

• Data replenished at query time

– RDD replicated to random executors• On YARN, multiple executors can be brought up on the same node in

different containers• Hence multiple replicas possible on the same node/rack, susceptible to

node/rack failure• Lost block replicas not recovered proactively

Topology Aware Replication (SPARK-15352)

– Ideas & Implementation by Shubham Chopra

– Making Peer selection for replication pluggable• Driver gets topology information for executors• Executors informed about this topology information• Executors use prioritization logic to order peers for block replication• Pluggable TopologyMapper and BlockReplicationPrioritizer• Default implementation replicates current Spark behavior

Topology Aware Replication (SPARK-15352)

– Customizable prioritization strategies to suit different deployments• Variety of replication objectives – ReplicateToDifferentHost,

ReplicateBlockWithinRack, ReplicateBlockOutsideRack• Optimizer to find a minimum number of peers to meet the objectives• Replicate to these peers with a higher priority

– Proactive replenishment of lost replicas• BlockManagerMasterEndpoint triggered replenishment when an executor

failure is detected.

Spark HA: Challenges– High Availability of Spark Driver

• High bootstrap cost to reconstructing cluster and cached state• Naïve HA models (such as multiple active clusters) surface query

inconsistency

– High Availability and Low Tail Latency closely related

Spark HA – A Strawman• Multiple Spark Servers in Leader-Standby

configuration• Each Spark Server backed by a different Spark

Cluster• Each Spark Server refreshed with up-to-date

data• Queries to standbys redirected to leader• Only leader responds to queries - Data

consistency• RDD Partition loss in the leader still a concern• Performance still gated by slowest executor

in leader• Resource usage amplified by the number of

Spark Servers

Spark Driver State• Spark Driver is an arbitrary Java application• Only a subset of the state is interesting or expensive to reconstruct• For online-use cases, only RDDs/DFs created during ingestion are of

interest• Expressing ingestion using DFs has better decoupling of data/state than

RDDs

Spark Driver State*• BlockManagerMasterEndpoint holds Block<->Executor assignment• Cache Manager holds Logical Plan and DataFrame references

– Used to short-circuit queries with pre-cached query plans, if possible

• Job Scheduler– Keeps a track of various stages and tasks being scheduled

• Executor information– Hostname and ports of live executors

*Illustrative, not exhaustive

Externalizing Driver StateBenefits:– Quicker recoveries– No need to restart executors– State accessible from multiple Active-Active drivers

Solutions:– Off-heap storage for RDDs– Residual book-keeping driver state externalized to ZooKeeper

Quorum of Drivers

THANK [email protected]

mailto:[email protected]

Apache Spark and Online Analytics

Software

Transcript of Apache Spark and Online Analytics