Apache Spark and Online Analytics
-
Upload
databricks -
Category
Software
-
view
361 -
download
0
Transcript of Apache Spark and Online Analytics
Spark and Online Analytics
Sudarshan Kadambi
Copyright 2016 Bloomberg L.P. All rights reserved.
Agenda• Data and Analytics at Bloomberg• The role of Spark• The Bloomberg Spark Server• Spark for Online use-cases
Data and Analytics is our Business
Analytics at Bloomberg• Human-time, interactive analytics• Scalability
– Handle increasingly sophisticated client analytic workflows– Ad-hoc and cross-domain aggregations, filtering
• Heterogeneous data stores– Analytics often requires data from multiple stores
• Low-latency updates, in addition to queries
Spark for Bloomberg Analytics• Distributed compute scales well for
– large security universes and – multi-universe cross-domain queries
• Abstract away heterogeneous data sources and present consistent interface for efficient data access
– Spark as a tool for systems integration
• Connectors and primitives to deal with incoming streams • Cache intermediate compute for fast queries
Spark as a Service?
6
• Stand-alone Spark Apps on isolated clusters pose challenges:
– Redundancy in:
» Crafting and Managing of RDDs/DFs
» Coding of the same or similar types of transforms/actions
– Management of clusters, replication of data, etc.
– Analytics are confined to specific content sets making Cross-Asset Analytics much harder
– Need to handle Real-time ingestion in each App
Spark Cluster
Spark App
Spark Cluster
Spark Server
Spark App
Spark App
Spark Cluster
Spark App
Bloomberg Spark Server
Spark Context
Request Processor
Request Processor
Request Processor
Request Handler
MDF Registry
7
Function Transform Registry (FTR)
RSI …
use
7
Ingestion ManagerMDF1
MDF2
1 2
1 2
Spark Server: Content Caching• Data access has long tail characteristics
• High value data sub-setted within Spark
• Specified as a filter predicate at time of registration
• Seamless unification of data in Spark and backing store
Spark HA: State of the World– Execution lineage in Driver
• Recovery from lost RDDs
– RDD Replication• Low latency, even with lost executors• Support for “MEMORY_ONLY”, “MEMORY_ONLY_2”, “MEMORY_ONLY_SER”,
“MEMORY_ONLY_SER_2” modes for in-memory persistence. Easily extensible to more replicas if needed.
– Speculative execution• Minimizing performance hit from stragglers
– Off-heap data• Minimizing GC stalls
Spark Architecture
RPC Environment RPC Environment
BlockManagerMasterEndpoint BlockManagerMaster
Driver Executor - 2
BlockManager
RPC Environment
BlockManagerMaster
Executor - 1
BlockManagerRDD - 0
RDDBlock(0, Partition-1)
RDDBlock(0, Partition-2)
RDD Block ReplicationExecutor-1 Executor-2Driver
Compute RDD
Computation complete Get Peers for replication
List of Peers
Replicate block to Peer Block stored locallyResults of computation
RDD Block Replication: Challenges– Lost RDD partitions costly to recover
• Data replenished at query time
– RDD replicated to random executors• On YARN, multiple executors can be brought up on the same node in
different containers• Hence multiple replicas possible on the same node/rack, susceptible to
node/rack failure• Lost block replicas not recovered proactively
Topology Aware Replication (SPARK-15352)
– Ideas & Implementation by Shubham Chopra
– Making Peer selection for replication pluggable• Driver gets topology information for executors• Executors informed about this topology information• Executors use prioritization logic to order peers for block replication• Pluggable TopologyMapper and BlockReplicationPrioritizer• Default implementation replicates current Spark behavior
Topology Aware Replication (SPARK-15352)
– Customizable prioritization strategies to suit different deployments• Variety of replication objectives – ReplicateToDifferentHost,
ReplicateBlockWithinRack, ReplicateBlockOutsideRack• Optimizer to find a minimum number of peers to meet the objectives• Replicate to these peers with a higher priority
– Proactive replenishment of lost replicas• BlockManagerMasterEndpoint triggered replenishment when an executor
failure is detected.
Spark HA: Challenges– High Availability of Spark Driver
• High bootstrap cost to reconstructing cluster and cached state• Naïve HA models (such as multiple active clusters) surface query
inconsistency
– High Availability and Low Tail Latency closely related
Spark HA – A Strawman• Multiple Spark Servers in Leader-Standby
configuration• Each Spark Server backed by a different Spark
Cluster• Each Spark Server refreshed with up-to-date
data• Queries to standbys redirected to leader• Only leader responds to queries - Data
consistency• RDD Partition loss in the leader still a concern• Performance still gated by slowest executor
in leader• Resource usage amplified by the number of
Spark Servers
Spark Driver State• Spark Driver is an arbitrary Java application• Only a subset of the state is interesting or expensive to reconstruct• For online-use cases, only RDDs/DFs created during ingestion are of
interest• Expressing ingestion using DFs has better decoupling of data/state than
RDDs
Spark Driver State*• BlockManagerMasterEndpoint holds Block<->Executor assignment• Cache Manager holds Logical Plan and DataFrame references
– Used to short-circuit queries with pre-cached query plans, if possible
• Job Scheduler– Keeps a track of various stages and tasks being scheduled
• Executor information– Hostname and ports of live executors
*Illustrative, not exhaustive
Externalizing Driver StateBenefits:– Quicker recoveries– No need to restart executors– State accessible from multiple Active-Active drivers
Solutions:– Off-heap storage for RDDs– Residual book-keeping driver state externalized to ZooKeeper
Quorum of Drivers