Cassandra Compute Cloud An Elastic Cassandra Infrastructure
Gurashish Singh Brar
Member of Technical Staff @ BloomReach
Abstract Dynamically scaling Cassandra to serve hundreds of map-reduce jobs that come at an unpredictable rate and at the same time providing access to the data in real time to front-end application with strict TP95 latency guarantees is a hard problem. We present a system for managing Cassandra clusters which provide following functionality: 1) Dynamic scaling of capacity to serve high throughput map-reduce jobs 2) Provide access to data generated by map-reduce jobs in realtime to front-end applications while providing latency SLAs for TP95 3) Maintain a low cost by leveraging Amazon Spot Instances and through demand based scaling. At the heart of this infrastructure lies a custom data replication service that makes it possible to stream data to new nodes as needed
What is it about ?
• Dynamically scaling the infrastructure to support large EMR jobs
• Throughput SLA to backend applications
• TP95 latency SLA to frontend applications
• Cassandra 2.0 using vnodes
Agenda
• Application requirements
• Major issues we encountered
• Solutions to the issues
Application Requirements
• Backend EMR jobs performing scans, lookups and writes Heterogeneous applications with varying degree of throughput SLAs. Very high peak loads Always available (no maintenance periods or planned downtimes)
• Frontend applications performing lookups Data from backend applications expected in realtime Low latencies
• Developer support
How we started Fr
onte
nd A
pplic
atio
ns
Cassandra Cluster
Frontend DC
Backend DC
EMR
Jobs
Frontend isolation using multiple DCs
Cassandra Cluster
Frontend DC
Backend DC
Frontend Issue: Spillover Reads
Cassandra Cluster
Frontend DC
Backend DC
Frontend Issue: Latencies vs Replication load Fr
onte
nd A
pplic
atio
ns
Cassandra Cluster
Frontend DC
Backend DC EM
R Jo
bs
Backend Issue: Fixed resource
Cassandra Cluster
Backend DC
EMR
Jobs
EMR
Jobs
Backend Issue: Fixed Resource
Cassandra Cluster
Backend DC
EMR
Jobs
EMR
Jobs
EMR
Jobs
EMR
Jobs
EMR
Jobs
EMR
Jobs
EMR
Jobs
EMR
Jobs
Backend Issue: Starvation
Cassandra Cluster
Backend DC
Large EMR Jobs with
relaxed SLA
Small EMR job with
tighter SLA
Summary of Issues
• Frontend isolation is not perfect
• Frontend latencies are impacted by backend write load
• EMR jobs can overwhelm the Cassandra cluster
• Large EMR jobs can starve smaller ones
Rate Limiter Fr
onte
nd A
pplic
atio
ns
Cassandra Cluster
Frontend DC
Backend DC
EMR
Jobs
Token Server (Redis)
Rate Limiter
• QPS allocated on per operation and application level
• Operations can be: scans, reads, writes, prepare, alter, create … etc
• Each mapper/reducer obtains permits for 1 minute (configurable).
• The token bucket is periodically refreshed with allocated capacity
• Quotas are dynamically adjusted to take advantage of unused quotas of applications ( We do want to maximize the cluster usage)
Why Redis ?
• High load from all EMR nodes
• Low latency
• Support high number of concurrent connections
• Support atomic fetch and add
Cost of Rate Limiter
• We converted EMR from an elastic resource to a fixed resource
• To scale EMR we have to scale Cassandra
• Adding capacity to Cassandra cluster is not trivial
• Adding capacity under heavy load is harder
• Auto scaling and reducing under heavy load is even harder
Managing capacity - Requirements
• Time to increase capacity should be in minutes
• Programmatic management and not manual
• Minimum load on the production cluster during the operation
C* increasing capacity
C* Cluster
Adding nodes is expensive
C* increasing capacity
C* Cluster
C* Cluster Sol: Replicate to a
new cluster
Custom Replication Service
Source Cluster
Destination Cluster
SSTable file copy
Custom Replication Service
Custom Replication Service • Replication Service (source node) takes snapshot of column family
• SSTables in snapshot are evenly streamed on destination cluster
• Replication Service (destination node) splits a single source SSTable to N SSTables
• Splits computed using SSTableReader & SSTableWriter classes. A single SSTable can be split in parallel by multiple threads
Custom Replication Service • Once split the new SSTables are streamed to correct destination nodes
• Rolling restart is initiated on the destination cluster (we could have used nodetool refresh, but it was unreliable)
• The cluster is ready for use
• In parallel trigger compaction on destination cluster for optimizing reads
Cluster Provisioning • Estimate the required cluster size based on column family disk size on source cluster
• Provision machines on AWS (Cassandra is pre-installed on AMI , so no setup required)
• Generate yaml and topology file with the new cluster and create a backend datacenter (Application agnostic)
• Copy schema from source cluster to destination cluster
• Call replication service on source cluster to replicate data
C* Compute Cloud
Source Cluster
Cluster Management service
On-demand cluster
On-demand cluster On-demand cluster
On-demand cluster
EMR
Jobs
C* Compute Cloud • Very high throughput in moving raw data from source to destination cluster (10 X
increase in network usage compared to normal)
• Little CPU/Memory load on the source cluster
• Leverage the size of destination cluster to compute new SSTables for the new ring
• Time to provision varies between 10 minutes to 40 minutes
• API driven so automatically scales up and down with demand
• Application agnostic
C* Compute Cloud - Limitations • Snapshot model : Take a snapshot of production and operate on it
This works really well for some use cases, good for most, but not all
• Provisioning time order of minutes
Works for EMR jobs which themselves take few minutes to provision but does not work for dedicated backend applications
• Writes still need to happen on production reserved cluster
Where we are now Fr
onte
nd A
pplic
atio
ns
Cassandra Cluster
Frontend DC
Backend DC
EMR
Jobs
Token Server (Redis)
On-demand cluster
On-demand cluster
On-demand cluster
Replication
Cluster Management service
Exploiting the C* compute cloud
• Key feature: Easy, automated and fast cluster provisioning with production data
• Use Spot Instances instead of On-Demand
• Failures in few nodes are survivable due to C* redundancy
• In case of too many failures, just rebuild on retry (its fast ! & automatic)
Spot Instances
• Service supports all instance types in AWS and all AZs
• Pick the optimal Spot Instance type & AZ that is the cheapest and satisfies the constraints
• Further reduces cost and improves reliability of the service
• If r3.2xlarge spot price spikes on retry service might pick c3.8xlarge
• Auto expire clusters to adjust automatically to cheaper instances
Cost or Capacity (take your pick)
Capacity of C* compute cloud on spot instances
~=
(5 to 10) X C* cluster using on-demand instances
for same $ value
Issues Addressed
• Backend Read Capacity can scale linearly with C* compute cloud
• Frontend latencies are protected from write load through rate limit
Remaining issues
• Read load on backend DC can spillover to frontend DC causing spikes
• Write capacity is still defined by frontend latencies
Issue: Spillover Reads
Cassandra Cluster
Frontend DC
Backend DC
Spillover Reads Fix: Fail the read
Cassandra Cluster
Frontend DC
Backend DC
X
Addressing the Write Capacity
• The obvious : Only push the updates that are new and not the same
Big improvement, 80-90% data did not change
• Add more nodes : With the backend read load off production it is lot easier to expand capacity
• But we are still operating at ~ 3rd or 5th the write capacity to keep read latencies low
Addressing the Write Capacity
• Experimental changes under evaluation
• Prioritize reads over writes on frontend Pause write stage during a read • Reduce replication load to frontend DC from backend DC ColumnLevel replication strategy Most frontend applications operate on a subset view of backend data
Key Takeaways
• Scale Cassandra dynamically for backend load by creating snapshot clusters
• Use rate limiter to protect the production cluster from spiky and unexpected backend traffic
• Build better isolation between frontend DC and backend DC
• Writes throughput from backend to frontend is a challenge
Questions ?
Thank you
Top Related