re:Invent 2012 Optimizing Cassandra

Optimizing Cassandra for AWS

Ruslan Meshenberg, Gregg Ulrich - Netflix

Agenda

Netflix

AWS Cassandra

Netflix Inc.

With more than 30 million streaming members in the United States, Canada, Latin America, the United Kingdom,

Ireland, Sweden, Norway, Denmark and Finland, Netflix, Inc. is the world's leading internet subscription service for

enjoying movies and TV series..

Why Cloud?

Netflix API – Growth in requests

Data Center Capacity

Netflix.com is now ~100% Cloud

• Some small back end data sources still in progress• USA specific logistics remains in the Datacenter• Working on SOX, PCI as scope starts to include AWS• All international product is cloud based

What is Cassandra?

• Persistent data store• NoSQL• Distributed key/value store• Tunable eventual consistency

Why did we choose Cassandra?

• Open sourced and written in Java• Multi-region replication• Data model supports wide range of use-cases• Runs on commodity hardware• Enhanced to understand AWS topology• Durable

Durability

• No single point of failure or specialized instances• Multiple copies of data across availability zones• Bootstrapping and hints restore data quickly• All writes appended to a commit log• Asynchronous cross-regional replication

How we configure Cassandra in AWS

1e1a 1b

us-east-1 eu-west-1

us-west-2

Durability (Quorum)

One instance: Availability zone:

Replica set:

How we configure Cassandra in AWS

• Mostly m2.4xlarge, but migrating to SSDs• Ephemeral storage for better performance• Multiple ASGs per cluster, each with one AZ• Single tenanted clusters• Overprovisioned clusters

Optimizations

• Cassandra enhancements• Client libraries• Operations• Schema and data management

Cassandra enhancements

• Bug fixes• New features• Performance• Security• AWS environment

Making a better Java client

• Multi-region and zone aware• Latency aware load balancer• Fluent API on top of Thrift• Best Practice Recipes

Filling the operational void

• Tomcat webapp for Cassandra administration• AWS-style instance provisioning• Full and incremental backups• JMX metrics collection• Consistent configuration across clusters• REST API for most administrative operations• Security Groups configuration

Managing your data and schema

• Missing UI for Cassandra client users• View and edit schema• Point queries and data updates• High level cluster status and metrics• Manages multiple Cassandra clusters• Integrated access control• Schema auditing

High level cluster status

Data query tool

Schema management tool

Operations

• June 29th AWS partial outage• Observations• Monitoring• Maintenances

From the Netflix tech blog:

“Cassandra, our distributed cloud persistence store which is distributed across all zones and regions, dealt with the loss of one third of its regional nodes without any loss of data or availability.”

June 29th AWS partial outage

• During outage- All Cassandra instances in us-east-1a were inaccessible- nodetool ring showed all nodes as DOWN- Monitoring other AZs to ensure availability- Waited for AWS to resolve the issue

• Recovery – power restored to us-east-1a- Majority of instances rejoined the cluster without issue- Most of remainder required a reboot to fix- The others needed to be replaced, one at a time

Observations: AWS

• Ephemeral drive performance is better than EBS• Instances seldom die on their own• Use as many availability zones as possible• Understand how AWS launches instances• I/O is constrained in most AWS instance types

- Repairs are very I/O intensive- Large size-tiered compactions can impact latency

• SSDs are game changers

Observations: Cassandra

• A slow node is worse than a down node• Cold cache increases load and kills latency• Use whatever dials you can find in an emergency

- Remove node from coordinator list- Compaction throttling- Min/max compaction thresholds- Enable/disable gossip

• Leveled compaction performance is very promising• 1.1.x and 1.2.x should address some big issues

Monitoring

• Actionable- Hardware and network issues- Cluster consistency

• Cumulative Cassandra trends- Throughput and latency- Key Cassandra metrics (queues, dropped ops, table reads)

• Informational- Schema changes- Log file errors/exceptions- Recent restarts

Maintenance

• Repair clusters regularly• Run off-line major compactions to avoid latency

- SSDs will make this unnecessary• Always replace nodes when they fail• Periodically replace all nodes in the cluster• Upgrade to new versions

- Binary (rpm) for major upgrades or emergencies- Rolling AMI push over time

Scaling Cassandrahttp://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

0 50 100 150 200 250 300 3500

200000

400000

600000

800000

1000000

1200000

174373

366828

537172

1099837

Client Writes/s by node count – Replication Factor = 3

800K writes per second in production

Disk vs. SSD BenchmarkSame Throughput, Lower Latency, Half Cost

Cassandra

Memcached

Application

Load GenerationLoad Test

Driver

REST service

36x m2.xlarge EVcache

48x m2.4xlarge Cassandra

REST service

15x hi1.4xlarge

Cassandra

Netflix is “all in” with Cassandra

50 Number of production clusters

15 Number of multi-region clusters

4 Max regions, one cluster

101 Total TB of data across all clusters

780 Number of Cassandra nodes

72/32 Largest Cassandra cluster (nodes/data in TB)

250k/800k Max read/writes per second on a single cluster

Future optimizations

• Cassandra as a Service• Fewer clusters, more data• Autoscaling Cassandra• Priam on PEDs• Self maintaining Cassandra clusters

All optimizations are open sourced

• Enhancements committed to open source project• Netflix@github

- Astyanax- Priam- Cassandra Explorers (coming soon)

• Motivations- Give back to Apache licensed OSS community- Help define best practices

Netflix Open Source Center

Conclusion

• Cassandra is high performing and durable in AWS• Cassandra is flexible enough to handle most use-cases• AWS offerings help provide a complete solution• Cassandra performs well in AWS, especially on SSDs• “Just because Netflix does it doesn’t make it right for you”

• http://techblog.netflix.com• http://netflix.github.com• Twitter

• @Netflix• @NetflixJobs• @rusmeshenberg (Ruslan)• @eatupmartha (Gregg)

We are sincerely eager to hear your FEEDBACK on this presentation and on re:Invent.

Please fill out an evaluation form when you have a

chance.

re:Invent 2012 Optimizing Cassandra

Technology

Transcript of re:Invent 2012 Optimizing Cassandra

Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016

Cassandra Summit EU 2014 - Testing Cassandra Applications

Apache Cassandra at Target - Cassandra Summit 2014

Optimizing Cassandra in AWS

Optimizing Apache Cassandra at Scale - HGST · SOLUTION BRIEF OPTIMIZING APACHE CASSANDRA™ AT SCALE Apache Cassandra™ database at scale can use both the cost-effective capacity

AWS re:Invent 2016: Optimizing workloads in SAP HANA with Amazon EC2 X1 Instances (CMP322)

Cassandra at eBay - Cassandra Summit 2013

re:invent 2017 サービスレポート

Apache Cassandra in Action - O'Reilly Mediaassets.en.oreilly.com/1/event/55/Apache Cassandra in Action... · Apache Cassandra in Action. Why Cassandra? ... Cassandra in production.

Cassandra Freeman - Thoughtful Inspirationsthoughtfulinspirations.com/.../2017/12/Cassandra-Freeman-Final-Bio.pdf · Cassandra Freeman Cassandra Freeman ... Jim Rohn, Zig Ziglar Leadership

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)

Cassandra CLuster Management by Japan Cassandra Community

C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Cassandra @ Yahoo Japan | Cassandra Summit 2016

Associate)Professor)Cassandra)L.Atherton) Deakin ...cassandra-atherton.com/wp-content/uploads/2018/01/Cassandra-Athert… · Associate)Professor)Cassandra)L.Atherton) Deakin’University’

AWS re:invent 2015

Distributed Counters in Cassandra (Cassandra Summit 2010)

AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture from the Farm to the Table (IOT402)

AWS re:Invent Hackathon

Running Cassandra on Amazon’s ECS - Meetupfiles.meetup.com/7439192/Cassandra-ECS.pdf · • Cassandra • ECS • Cassandra on Docker best practices • Cassandra on ECS. Motivation.