Cassandra Operations at Netflix

download Cassandra Operations at Netflix

of 18

  • date post

  • Category


  • view

  • download


Embed Size (px)


Slides from Netflix Cassandra Meetup on 3/27. Lessons learned and tools created at Netflix to manage Cassandra clusters in AWS.

Transcript of Cassandra Operations at Netflix

  • 1. Cassandra Operations at NetflixGregg Ulrich1

2. Agenda Who we are How much we use Cassandra How we do it What we learned2 3. Who we are Cloud Database Engineering Development Cassandra and related tools Architecture data modeling and sizing Operations availability, performance and maintenance Operations 24x7 on-call support for all Cassandra clusters Cassandra operations tools Proactive problem hunting Routine and non-routine maintenances 3 4. How much we use Cassandra30 Number of production clusters12 Number of multi-region clusters3Max regions, one cluster65 Total TB of data across all clusters472Number of Cassandra nodes72/28Largest Cassandra cluster (nodes/data in TB)50k/250k Max read/writes per second on a single cluster3* Size of Operations team * Open position for an additional engineer4 5. I read that Netflix doesnt have operations Extension of Amazons PaaS Decentralized Cassandra ops is expensive at scale Immature product that changes rapidly (and drastically) Easily apply best practices across all clusters5 6. How we configure Cassandra in AWS Most services get their own Cassandra cluster Mostly m2.4xlarge instances, but considering others Cassandra and supporting tools baked into the AMI Data stored on ephemeral drives Data durability all writes to all availabilty zones Alternate AZs in a replication set RF = 36 7. Minimum cluster configuration Minimum production cluster configuration 6 nodes 3 auto-scaling groups 2 instances per auto-scaling group 1 availability zone per auto-scaling group 7 8. Minimum cluster configuration, illustratedASG1 AZ1 RF=3ASG2 AZ2 PRIAMASG3 AZ3 8 9. Tools we use Administration Priam Jenkins Monitoring and alerting Cassandra Explorer Dashboards Epic9 10. Tools we use Priam Open-sourced Tomcat webapp running on each instance Multi-region token management via SimpleDB Node replacement and ring expansion Backup and restore Full nightly snapshot backup to S3 Incremental backup of flushed SSTables to S3 every 30 seconds Metrics collected via JMX REST API to most nodetool functions10 11. Tools we use Cassandra Explorer Kiosk mode noalerting High level clusterstatus (thrift, gossip) Warns on a small setof metrics11 12. Tools we use Epic Netflix-widemonitoring andalerting tool based onRRD Priam proxies all JMXdata to Epic Very useful for findingspecific issues12 13. Tools we use Dashboards Next level clustermetrics Throughput Latency Gossip status Maintenanceoperations Trouble indicators Useful for findinganomalies Most investigationsstart here13 14. Tools we use Jenkins Scheduling tool for additionalmonitors and maintenancetasks Push button automation forrecurring tasks Repairs, upgrades, and othertasks are only performedthrough Jenkins to preservehistory of actions On-call dashboard displayscurrent issues and maintenancerequired 14 15. Things we monitorCassandra System Throughput Disk space Latency Load average Compactions I/O errors Repairs Network errors Pending threads Dropped operations Java heap SSTable counts Cassandra log files 15 16. Other things we monitor Compaction predictions Backup failures Recent restarts Schema changes Monitors 16 17. What we learned Having Cassandra developers in house is crucial Repairs are incredibly expensive Multi-tenanted clusters are challenging A down node is better than a slow node Better to compact on our terms and not Cassandras Sizing and tuning is difficult and often done live Smaller per-node data size is better 17 18. Q&A (and Recommended viewing) The Best of Times Taft and Bakersfield are real places South Park Later season episodes like F-Word and Elementary School Musical Caillou My kids love this show; I dont know why Until the Light Takes Us Scary documentary on Norwegian Black Metal 18