Apache Cassandra at Pager Duty 2014

10/20/14

Watching Your Cassandra Cluster Melt

10/20/14

What is PagerDuty?

WATCHING YOUR CASSANDRA CLUSTER MELT

10/20/14

Cassandra at PagerDuty


• Used to provide durable, consistent read/writes in a critical pipeline of

service applications

• Scala, Cassandra, Zookeeper.

• Receives ~25 requests a sec

• Each request is a handful of operations then processed asynchronously

• Never lose an event. Never lose a message.

• This has HUGE implications around our design and architecture.

10/20/14



• Cassandra 1.2

• Thrift API

• Using Hector/Cassie/Astyanax

• Assigned tokens

• Putting off migrating to vnodes

• It is not big data

• Clusters ~10s of GB

• Data in the pipe is considered ephemeral

10/20/14



DC-C

DC-A DC-B

~20 MS ~5 MS

~20 MS

• Five (or ten) nodes in three regions

• Quorum CL

• RF = 5

10/20/14



• Operations cross the WAN and take inter-DC latency hit.

• Since we use it as our pipeline without much of a user-facing front,

we’re not latency sensitive, but throughput sensitive.

• We get consistent read/write operations.

• Events aren’t lost. Messages aren’t repeated.

• We get availability in the face of a loss of entire DC-region.

10/20/14

What Happened?


• Everything fell apart and our critical pipeline began refusing new events and

halted progress on existing ones.

• Created degraded performance and a three-hour outage in PagerDuty

• Unprecedented flush of in-flight data

• Gory details on the impact found on the PD blog: https://blog.pagerduty.com/

2014/06/outage-post-mortem-june-3rd-4th-2014/

https://blog.pagerduty.com/2014/06/outage-post-mortem-june-3rd-4th-2014/

10/20/14

What Happened…


• It was just a semi-regular day…

• …no particular changes in traffic

• …no particular changes in volume

• We had an incident the day before

• Repairs and compactions had been taking longer and longer. They

were starting to overlap on machines.

• We used ‘nodetool disablethrift' to mitigate load on nodes that

couldn’t handle being coordinators.

• We even disabled nodes and found odd improvements with a

smaller 3/5 cluster (any 3/5).

• The next day, we started a repair that had been foregone…

10/20/14

What happened…


1 MIN SYSTEM LOAD

10/20/14

What we did…


• Tried a few things to mitigate the damage

• Stopped less critical tenants.

• Disabled thrift interfaces

• Disabled nodes

• No discernible effect.

• Left with no choice, we blew away all data and restarted Cassandra fresh

• This only took 10 minutes after committing to do this.

sudo rm -r /var/lib/cassandra/commitlog/*

sudo rm -r /var/lib/cassandra/saved_caches/*

sudo rm -r /var/lib/cassandra/data/*

• Then everything was fine and dandy, like sour candy.

10/20/14

So, what happened…?


WHAT WENT HORRIBLY WRONG?

• Multi-tenancy in the Cassandra cluster.

• Operational ease isn’t worth the transparency.

• Underprovisioning

• AWS m1.larges

• 2 cores

• 8 GB RAM <—definitely not enough.

• Poor monitoring and high-water marks

• A twisted desire to get everything out of our little cluster

10/20/14

Why we didn’t see it coming…


OR, HOW I LIKE TO MAKE MYSELF FEEL BETTER.

• Everything was fine 99% of the time.

• Read/write latencies close to the inter-DC latencies.

• Despite load being relatively high sometimes.

• Cassandra seems to have two modes: fine and catastrophe

• We thought, “we don’t have much data, it should be able to handle this.”

• Thought we must have misconfigured something. We didn’t need to scale up…

10/20/14

What we should have seen…


CONSTANT MEMORY PRESSURE

This is bad

This is good

10/20/14

What we should have seen…


• Consistent memtable flushing

• “Flushing CFS(…) to relieve memory pressure”

• Slower repair/compaction times

• Likely related to the memory pressure

• Widening disparity between median and p95 read/write latencies

10/20/14

What we changed…


THE AFTERMATH WAS ROUGH…

• Immediately replaced all nodes with m2.2xlarges

• 4 cores

• 32 GB RAM

• No more multi-tenancy.

• Required nasty service migrations

• Began watching a lot of pending task metrics.

• Flushed blocker writers

• Dropped messages

10/20/14

Lessons Learned


• Cassandra has a steep performance degradation.

• Stay ahead of the scaling curve.

• Jump on any warning signs

• Practice scaling. Be able to do it on quick notice.

• Cassandra performance deteriorates with changes in the data set and

asynchronous, eventual consistency.

• Just because your latencies were one way doesn’t mean they’re

supposed to be that way.

• Don’t build for multi tenancy in your cluster.

10/20/14

PS. We’re hiring Cassandra people (enthusiast to expert) in our Realtime or Persistence teams.

Thank you.

http://www.pagerduty.com/company/work-with-us/

http://bit.ly/1ym8j9g

http://www.pagerduty.com/company/work-with-us/

http://bit.ly/1ym8j9g

Apache Cassandra at Pager Duty 2014

Technology

Transcript of Apache Cassandra at Pager Duty 2014