Cassandra Operations at Netflix

1

Cassandra Operations at Netflix

Gregg Ulrich

2

Agenda

Who we are

How much we use Cassandra

How we do it

What we learned

3

Who we are

Cloud Database Engineering Development – Cassandra and related tools

Architecture – data modeling and sizing

Operations – availability, performance and maintenance

Operations 24x7 on-call support for all Cassandra clusters

Cassandra operations tools

Proactive problem hunting

Routine and non-routine maintenances

4

How much we use Cassandra

30 Number of production clusters

12 Number of multi-region clusters

3 Max regions, one cluster

65 Total TB of data across all clusters

472 Number of Cassandra nodes

72/28 Largest Cassandra cluster (nodes/data in TB)

50k/250k Max read/writes per second on a single cluster

3 * Size of Operations team

* Open position for an additional engineer

5

I read that Netflix doesn’t have operations

Extension of Amazon’s PaaS

Decentralized Cassandra ops is expensive at scale

Immature product that changes rapidly (and drastically)

Easily apply best practices across all clusters

6

How we configure Cassandra in AWS

Most services get their own Cassandra cluster

Mostly m2.4xlarge instances, but considering others

Cassandra and supporting tools baked into the AMI

Data stored on ephemeral drives

Data durability – all writes to all availabilty zones Alternate AZs in a replication set

RF = 3

7

Minimum cluster configuration

Minimum production cluster configuration – 6 nodes 3 auto-scaling groups

2 instances per auto-scaling group

1 availability zone per auto-scaling group

8

Minimum cluster configuration, illustrated

ASG1 AZ1

ASG2 AZ2

ASG3 AZ3

PRIAM

RF=3

9

Tools we use

Administration Priam

Jenkins

Monitoring and alerting Cassandra Explorer

Dashboards

Epic

10

Tools we use – Priam

Open-sourced Tomcat webapp running on each instance

Multi-region token management via SimpleDB

Node replacement and ring expansion

Backup and restore Full nightly snapshot backup to S3

Incremental backup of flushed SSTables to S3 every 30 seconds

Metrics collected via JMX

REST API to most nodetool functions

11

• Kiosk mode – no alerting

• High level cluster status (thrift, gossip)

• Warns on a small set of metrics

Tools we use – Cassandra Explorer

12

• Netflix-wide monitoring and alerting tool based on RRD

• Priam proxies all JMX data to Epic

• Very useful for finding specific issues

Tools we use – Epic

13

• Next level cluster metrics

• Throughput

• Latency

• Gossip status

• Maintenance operations

• Trouble indicators

• Useful for finding anomalies

• Most investigations start here

Tools we use – Dashboards

14

• Scheduling tool for additional monitors and maintenance tasks

• Push button automation for recurring tasks

• Repairs, upgrades, and other tasks are only performed through Jenkins to preserve history of actions

• On-call dashboard displays current issues and maintenance required

Tools we use – Jenkins

15

Things we monitor

Cassandra Throughput

Latency

Compactions

Repairs

Pending threads

Dropped operations

Java heap

SSTable counts

Cassandra log files

System Disk space

Load average

I/O errors

Network errors

16

Other things we monitor

Compaction predictions

Backup failures

Recent restarts

Schema changes

Monitors

17

What we learned

Having Cassandra developers in house is crucial

Repairs are incredibly expensive

Multi-tenanted clusters are challenging

A down node is better than a slow node

Better to compact on our terms and not Cassandra’s

Sizing and tuning is difficult and often done live

Smaller per-node data size is better

18

Q&A (and Recommended viewing)The Best of TimesTaft and Bakersfield are real places

South ParkLater season episodes like F-Word and Elementary School Musical

CaillouMy kids love this show; I don’t know why

Until the Light Takes UsScary documentary on Norwegian Black Metal

Cassandra Operations at Netflix

Technology

Transcript of Cassandra Operations at Netflix