Cassandra Summit 2014: Active-Active Cassandra Behind the Scenes

ABOUT NETFLIX!

NETFLIX

ACTIVE - ACTIVE!

WHAT IS ACTIVE ACTIVE?

Also called dual active, it is a phrase used to describe a network of independent processing nodes where each node has access to replicated database. Traffic intended for a failed node is either passed onto an existing node or load balanced across the remaining nodes.

WHY ACTIVE-ACTIVE ?!

ENTERPISE IT SOLUTIONS

WEB SCALE CLOUD

SOLUTIONS

RAPID SCALING

HIGH AVAILABILITY

DOES AN INSTANCE FAIL?!•  It can, plan for it!•  Bad code / configuration pushes!•  Latent issues!•  Hardware failure!•  Test with Chaos Monkey!

DOES A ZONE FAIL?!•  Rarely, but happened before!•  Routing issues!•  DC-specific issues!•  App-specific issues within a zone!•  Test with Chaos Gorilla!

DOES A REGION FAIL?!

•  Full region – unlikely, very rare!•  Individual Services can fail region-wide!•  Most likely, a region-wide configuration issues!•  Test with Chaos Kong!

EVERYTHING FAILS… EVENTUALLY!

•  Keep your services running by embracing isolation and redundancy!

•  Construct a highly agile and highly available service from ephemeral and assumed broken components!

ISOLATION!

•  Changes in one region should not affect others!•  Regional outage should not affect others!•  Network partitioning between regions should not affect

functionality / operations!

REDUNDANCY!

•  Make more than one (of pretty much everything)!•  Specifically, distribute services across Availability

Zones and regions!

HISTORY: X-MAS EVE 2012!

•  Netflix multi-hour outage!•  US-East1 regional Elastic Load Balancing issue!!•  “...data was deleted by a maintenance process

that was inadvertently run against the production ELB state data”!

ACTIVE-ACTIVE ARCHITECTURE!

THE PROCESS!

IDENTIFYING CLUSTERS FOR AA !

SNITCH CHANGES!

EC2Snitch! EC2MultiRegionSnitch!

Uses Private IPs! Uses Public IPs!

PRIAM.MULTIREGION.ENABLE =TRUE!

storage_port : Using Private IPs!

ssl_storage_port : Using Public IPs!

SPIN UP NODES IN NEW REGION!

us-east-1! us-west-2!

UPDATE KEYSPACE!

Update keyspace <keyspace> with placement_strategy = 'NetworkTopologyStrategy'! and strategy_options = {us-east : 3, us-west-2 : 3};!

Existing region and replication factor ! New region and replication factor!

REBUILD NEW REGION

Run – nodetool rebuild us-east-1 on all us-west-2 nodes

RUN NODETOOL REPAIR

VALIDATION!

BENCHMARKING GLOBAL CASSANDRA WRITE INTENSIVE TEST OF CROSS-REGION REPLICATION

CAPACITY 16 X HI1.4XLARGE SSD NODES PER ZONE = 96 TOTAL 192 TB OF SSD IN SIX LOCATIONS UP AND RUNNING

CASSANDRA IN 20 MINUTES!

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-West-2 Region - Oregon

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East-1 Region - Virginia

Test Load

Validation Load

Interzone Traffic 18TB Backup Restored from S3 using Priam

1 Million Writes!CL.ONE (Wait for One Replica to ack)!

1 Million Reads!after 500 ms!CL.ONE with No!Data Loss!

Interregional Traffic!Up to 9Gbits/s, 83ms! 18 TB backups

from S3

TEST FOR THUNDERING HERD!

TEST FOR RETRIES!

FAILURE RETRY

KEY METRICS USED!

•  99 /95 th Read Latency (Client & C*)!•  Dropped Metrics on C*!•  Exceptions on C*!•  Heap Usage on C*!•  Threads Pending on C*!

CONFIGURATION FOR TEST!

•  24 Node C* SSDs!•  220 Client instances!•  70+ Jmeter Instances!

C* IOPS

TOTAL READ IOPS

TOTAL WRITE IOPS

95th LATENCY

99th LATENCY

CHECK FOR CEILING!

NETWORK PARTITION!

us-east-1 us-west-2

TAKEAWAYS!

REPAIRS AFTER EXTENSION ARE PAINFUL !!!

TIME TO REPAIR DEPENDS ON!

•  Number of regions!•  Number of replicas!•  Data size!•  Amount of entropy!!

ADJUST GC_GRACE AFTER EXTENSION!•  Column Family Setting!•  Defined in seconds!•  Default 10 days!•  Tweak gc_grace settings to

accommodate time taken to repair!•  BEWARE of deleted columns!

RUNBOOK!

PLAN FOR CAPACITY!

CONSISTENCY LEVEL !

•  Check the client for consistency level setting!•  In a Multiregional cluster QUORUM <>

LOCAL_QUORUM!•  Recommended consistency levels

LOCAL_ONE (CASSANDRA-6202) for reads and LOCAL_QUORUM for writes!

•  For region resiliency avoid – ALL or QUORUM calls!

HOW DO WE KNOW IT WORKS? CREATE CHAOS!!!

Benchmark …!!Time Consuming !!But worth it!!

Cassandra Summit 2014: Active-Active Cassandra Behind the Scenes

Technology

Transcript of Cassandra Summit 2014: Active-Active Cassandra Behind the Scenes

Cassandra Day Atlanta 2016 - Monitoring Cassandra

Apache Cassandra at Target - Cassandra Summit 2014

Cassandra Summit EU 2014 - Testing Cassandra Applications

Cassandra Day NYC - Cassandra anti patterns

Cassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy

A GUIDE TO STRESS TESTING KAFKA, SPARK AND CASSANDRA … · Spark Workers. The nodes are named Spark-Cassandra-Master, Spark-Cassandra-Worker01 and Spark-Cassandra-Worker02. The Cassandra

Paris Cassandra Meetup - Cassandra for Developers

Cassandra Summit 2014: Apache Cassandra on Pivotal CloudFoundry

LA Cassandra Day 2015 - Cassandra for developers

Cassandra Community Webinar | Cassandra 2.0 - Better, Faster, Stronger

Cassandra Day Atlanta 2015: Python & Cassandra

Cassandra + Hadoop: Analisi Batch con Apache Cassandra

LA Cassandra Day 2015 - Testing Cassandra

Cassandra Freeman - Thoughtful Inspirationsthoughtfulinspirations.com/.../2017/12/Cassandra-Freeman-Final-Bio.pdf · Cassandra Freeman Cassandra Freeman ... Jim Rohn, Zig Ziglar Leadership

Romania is a country with a multicultural music environment which includes active ethnic music scenes. Romania also has thriving scenes in the fields.

Cassandra at eBay - Cassandra Summit 2012

UN7300AUD Series · Great scenes keep getting better. Automatically elevate the beauty of your favorite scenes. Active HDR supports a wide range of formats for scene-by-scene picture

Cassandra Summit 2014: Performance Tuning Cassandra in AWS

manoa.hawaii.edumanoa.hawaii.edu/liveonstage/wp-content/uploads/2002.JudgeBaoAnd...Gilbert Molina Cassandra Wormser (Scenes 14) Alma Pasic (Scenes 5-6) Debra Jean Zwicker Chihiro Hosono

Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastructure