Download - Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastructure

Cassandra Compute Cloud An Elastic Cassandra Infrastructure

Gurashish Singh Brar

Member of Technical Staff @ BloomReach

Abstract Dynamically scaling Cassandra to serve hundreds of map-reduce jobs that come at an unpredictable rate and at the same time providing access to the data in real time to front-end application with strict TP95 latency guarantees is a hard problem. We present a system for managing Cassandra clusters which provide following functionality: 1) Dynamic scaling of capacity to serve high throughput map-reduce jobs 2) Provide access to data generated by map-reduce jobs in realtime to front-end applications while providing latency SLAs for TP95 3) Maintain a low cost by leveraging Amazon Spot Instances and through demand based scaling. At the heart of this infrastructure lies a custom data replication service that makes it possible to stream data to new nodes as needed

What is it about ?

•  Dynamically scaling the infrastructure to support large EMR jobs

•  Throughput SLA to backend applications

•  TP95 latency SLA to frontend applications

•  Cassandra 2.0 using vnodes

Agenda

•  Application requirements

•  Major issues we encountered

•  Solutions to the issues

Application Requirements

•  Backend EMR jobs performing scans, lookups and writes Heterogeneous applications with varying degree of throughput SLAs. Very high peak loads Always available (no maintenance periods or planned downtimes)

•  Frontend applications performing lookups Data from backend applications expected in realtime Low latencies

•  Developer support

How we started Fr

onte

nd A

pplic

atio

ns

Cassandra Cluster

Frontend DC

Backend DC

EMR

Jobs

Frontend isolation using multiple DCs

Cassandra Cluster

Frontend DC

Backend DC

Frontend Issue: Spillover Reads

Cassandra Cluster

Frontend DC

Backend DC

Frontend Issue: Latencies vs Replication load Fr

onte

nd A

pplic

atio

ns

Cassandra Cluster

Frontend DC

Backend DC EM

R Jo

bs

Backend Issue: Fixed resource

Cassandra Cluster

Backend DC

EMR

Jobs

EMR

Jobs

Backend Issue: Fixed Resource

Cassandra Cluster

Backend DC

EMR

Jobs

EMR

Jobs

EMR

Jobs

EMR

Jobs

EMR

Jobs

EMR

Jobs

EMR

Jobs

EMR

Jobs

Backend Issue: Starvation

Cassandra Cluster

Backend DC

Large EMR Jobs with

relaxed SLA

Small EMR job with

tighter SLA

Summary of Issues

•  Frontend isolation is not perfect

•  Frontend latencies are impacted by backend write load

•  EMR jobs can overwhelm the Cassandra cluster

•  Large EMR jobs can starve smaller ones

Rate Limiter Fr

onte

nd A

pplic

atio

ns

Cassandra Cluster

Frontend DC

Backend DC

EMR

Jobs

Token Server (Redis)

Rate Limiter

•  QPS allocated on per operation and application level

•  Operations can be: scans, reads, writes, prepare, alter, create … etc

•  Each mapper/reducer obtains permits for 1 minute (configurable).

•  The token bucket is periodically refreshed with allocated capacity

•  Quotas are dynamically adjusted to take advantage of unused quotas of applications ( We do want to maximize the cluster usage)

Why Redis ?

•  High load from all EMR nodes

•  Low latency

•  Support high number of concurrent connections

•  Support atomic fetch and add

Cost of Rate Limiter

•  We converted EMR from an elastic resource to a fixed resource

•  To scale EMR we have to scale Cassandra

•  Adding capacity to Cassandra cluster is not trivial

•  Adding capacity under heavy load is harder

•  Auto scaling and reducing under heavy load is even harder

Managing capacity - Requirements

•  Time to increase capacity should be in minutes

•  Programmatic management and not manual

•  Minimum load on the production cluster during the operation

C* increasing capacity

C* Cluster

Adding nodes is expensive

C* increasing capacity

C* Cluster

C* Cluster Sol: Replicate to a

new cluster

Custom Replication Service

Source Cluster

Destination Cluster

SSTable file copy

Custom Replication Service

Custom Replication Service •  Replication Service (source node) takes snapshot of column family

•  SSTables in snapshot are evenly streamed on destination cluster

•  Replication Service (destination node) splits a single source SSTable to N SSTables

•  Splits computed using SSTableReader & SSTableWriter classes. A single SSTable can be split in parallel by multiple threads

Custom Replication Service •  Once split the new SSTables are streamed to correct destination nodes

•  Rolling restart is initiated on the destination cluster (we could have used nodetool refresh, but it was unreliable)

•  The cluster is ready for use

•  In parallel trigger compaction on destination cluster for optimizing reads

Cluster Provisioning •  Estimate the required cluster size based on column family disk size on source cluster

•  Provision machines on AWS (Cassandra is pre-installed on AMI , so no setup required)

•  Generate yaml and topology file with the new cluster and create a backend datacenter (Application agnostic)

•  Copy schema from source cluster to destination cluster

•  Call replication service on source cluster to replicate data

C* Compute Cloud

Source Cluster

Cluster Management service

On-demand cluster

On-demand cluster On-demand cluster

On-demand cluster

EMR

Jobs

C* Compute Cloud •  Very high throughput in moving raw data from source to destination cluster (10 X

increase in network usage compared to normal)

•  Little CPU/Memory load on the source cluster

•  Leverage the size of destination cluster to compute new SSTables for the new ring

•  Time to provision varies between 10 minutes to 40 minutes

•  API driven so automatically scales up and down with demand

•  Application agnostic

C* Compute Cloud - Limitations •  Snapshot model : Take a snapshot of production and operate on it

This works really well for some use cases, good for most, but not all

•  Provisioning time order of minutes

Works for EMR jobs which themselves take few minutes to provision but does not work for dedicated backend applications

•  Writes still need to happen on production reserved cluster

Where we are now Fr

onte

nd A

pplic

atio

ns

Cassandra Cluster

Frontend DC

Backend DC

EMR

Jobs

Token Server (Redis)

On-demand cluster

On-demand cluster

On-demand cluster

Replication

Cluster Management service

Exploiting the C* compute cloud

•  Key feature: Easy, automated and fast cluster provisioning with production data

•  Use Spot Instances instead of On-Demand

•  Failures in few nodes are survivable due to C* redundancy

•  In case of too many failures, just rebuild on retry (its fast ! & automatic)

Spot Instances

•  Service supports all instance types in AWS and all AZs

•  Pick the optimal Spot Instance type & AZ that is the cheapest and satisfies the constraints

•  Further reduces cost and improves reliability of the service

•  If r3.2xlarge spot price spikes on retry service might pick c3.8xlarge

•  Auto expire clusters to adjust automatically to cheaper instances

Cost or Capacity (take your pick)

Capacity of C* compute cloud on spot instances

~=

(5 to 10) X C* cluster using on-demand instances

for same $ value

Issues Addressed

•  Backend Read Capacity can scale linearly with C* compute cloud

•  Frontend latencies are protected from write load through rate limit

Remaining issues

•  Read load on backend DC can spillover to frontend DC causing spikes

•  Write capacity is still defined by frontend latencies

Issue: Spillover Reads

Cassandra Cluster

Frontend DC

Backend DC

Spillover Reads Fix: Fail the read

Cassandra Cluster

Frontend DC

Backend DC

X

Addressing the Write Capacity

•  The obvious : Only push the updates that are new and not the same

Big improvement, 80-90% data did not change

•  Add more nodes : With the backend read load off production it is lot easier to expand capacity

•  But we are still operating at ~ 3rd or 5th the write capacity to keep read latencies low

Addressing the Write Capacity

•  Experimental changes under evaluation

•  Prioritize reads over writes on frontend Pause write stage during a read •  Reduce replication load to frontend DC from backend DC ColumnLevel replication strategy Most frontend applications operate on a subset view of backend data

Key Takeaways

•  Scale Cassandra dynamically for backend load by creating snapshot clusters

•  Use rate limiter to protect the production cluster from spiky and unexpected backend traffic

•  Build better isolation between frontend DC and backend DC

•  Writes throughput from backend to frontend is a challenge

Questions ?

Thank you