Apache Cassandra at Pager Duty 2014

17
10/20/14 Watching Your Cassandra Cluster Melt

description

Video: https://www.youtube.com/watch?v=wbUmIacfswU Speaker: Owen Kim, Software Engineer Company: Pager Duty PagerDuty had the misfortune of watching its abused, underprovisioned Cassandra cluster collapse. This talk will cover the lessons learned from that experience like: • Which of the many, many metrics did we learn to watch for • What mistakes we made that lead to this catastrophe • How we have changed our use to make our Cassandra cluster more stable Owen Kim is a Software Engineer at PagerDuty and enjoys whiskey, riding his Honda Shadow 600 (named "Chie") and discussing the finer points in narrative and expression in video games.

Transcript of Apache Cassandra at Pager Duty 2014

Page 1: Apache Cassandra at Pager Duty 2014

10/20/14

Watching Your Cassandra Cluster Melt

Page 2: Apache Cassandra at Pager Duty 2014

10/20/14

What is PagerDuty?

WATCHING YOUR CASSANDRA CLUSTER MELT

Page 3: Apache Cassandra at Pager Duty 2014

10/20/14

Cassandra at PagerDuty

WATCHING YOUR CASSANDRA CLUSTER MELT

• Used to provide durable, consistent read/writes in a critical pipeline of

service applications

• Scala, Cassandra, Zookeeper.

• Receives ~25 requests a sec

• Each request is a handful of operations then processed asynchronously

• Never lose an event. Never lose a message.

• This has HUGE implications around our design and architecture.

Page 4: Apache Cassandra at Pager Duty 2014

10/20/14

Cassandra at PagerDuty

WATCHING YOUR CASSANDRA CLUSTER MELT

• Cassandra 1.2

• Thrift API

• Using Hector/Cassie/Astyanax

• Assigned tokens

• Putting off migrating to vnodes

• It is not big data

• Clusters ~10s of GB

• Data in the pipe is considered ephemeral

Page 5: Apache Cassandra at Pager Duty 2014

10/20/14

Cassandra at PagerDuty

WATCHING YOUR CASSANDRA CLUSTER MELT

DC-C

DC-A DC-B

~20 MS ~5 MS

~20 MS

• Five (or ten) nodes in three regions

• Quorum CL

• RF = 5

Page 6: Apache Cassandra at Pager Duty 2014

10/20/14

Cassandra at PagerDuty

WATCHING YOUR CASSANDRA CLUSTER MELT

• Operations cross the WAN and take inter-DC latency hit.

• Since we use it as our pipeline without much of a user-facing front,

we’re not latency sensitive, but throughput sensitive.

• We get consistent read/write operations.

• Events aren’t lost. Messages aren’t repeated.

• We get availability in the face of a loss of entire DC-region.

Page 7: Apache Cassandra at Pager Duty 2014

10/20/14

What Happened?

WATCHING YOUR CASSANDRA CLUSTER MELT

• Everything fell apart and our critical pipeline began refusing new events and

halted progress on existing ones.

• Created degraded performance and a three-hour outage in PagerDuty

• Unprecedented flush of in-flight data

• Gory details on the impact found on the PD blog: https://blog.pagerduty.com/

2014/06/outage-post-mortem-june-3rd-4th-2014/

Page 8: Apache Cassandra at Pager Duty 2014

10/20/14

What Happened…

WATCHING YOUR CASSANDRA CLUSTER MELT

• It was just a semi-regular day…

• …no particular changes in traffic

• …no particular changes in volume

• We had an incident the day before

• Repairs and compactions had been taking longer and longer. They

were starting to overlap on machines.

• We used ‘nodetool disablethrift' to mitigate load on nodes that

couldn’t handle being coordinators.

• We even disabled nodes and found odd improvements with a

smaller 3/5 cluster (any 3/5).

• The next day, we started a repair that had been foregone…

Page 9: Apache Cassandra at Pager Duty 2014

10/20/14

What happened…

WATCHING YOUR CASSANDRA CLUSTER MELT

1 MIN SYSTEM LOAD

Page 10: Apache Cassandra at Pager Duty 2014

10/20/14

What we did…

WATCHING YOUR CASSANDRA CLUSTER MELT

• Tried a few things to mitigate the damage

• Stopped less critical tenants.

• Disabled thrift interfaces

• Disabled nodes

• No discernible effect.

• Left with no choice, we blew away all data and restarted Cassandra fresh

• This only took 10 minutes after committing to do this.

sudo rm -r /var/lib/cassandra/commitlog/*

sudo rm -r /var/lib/cassandra/saved_caches/*

sudo rm -r /var/lib/cassandra/data/*

• Then everything was fine and dandy, like sour candy.

Page 11: Apache Cassandra at Pager Duty 2014

10/20/14

So, what happened…?

WATCHING YOUR CASSANDRA CLUSTER MELT

WHAT WENT HORRIBLY WRONG?

• Multi-tenancy in the Cassandra cluster.

• Operational ease isn’t worth the transparency.

• Underprovisioning

• AWS m1.larges

• 2 cores

• 8 GB RAM <—definitely not enough.

• Poor monitoring and high-water marks

• A twisted desire to get everything out of our little cluster

Page 12: Apache Cassandra at Pager Duty 2014

10/20/14

Why we didn’t see it coming…

WATCHING YOUR CASSANDRA CLUSTER MELT

OR, HOW I LIKE TO MAKE MYSELF FEEL BETTER.

• Everything was fine 99% of the time.

• Read/write latencies close to the inter-DC latencies.

• Despite load being relatively high sometimes.

• Cassandra seems to have two modes: fine and catastrophe

• We thought, “we don’t have much data, it should be able to handle this.”

• Thought we must have misconfigured something. We didn’t need to scale up…

Page 13: Apache Cassandra at Pager Duty 2014

10/20/14

What we should have seen…

WATCHING YOUR CASSANDRA CLUSTER MELT

CONSTANT MEMORY PRESSURE

This is bad

This is good

Page 14: Apache Cassandra at Pager Duty 2014

10/20/14

What we should have seen…

WATCHING YOUR CASSANDRA CLUSTER MELT

• Consistent memtable flushing

• “Flushing CFS(…) to relieve memory pressure”

• Slower repair/compaction times

• Likely related to the memory pressure

• Widening disparity between median and p95 read/write latencies

Page 15: Apache Cassandra at Pager Duty 2014

10/20/14

What we changed…

WATCHING YOUR CASSANDRA CLUSTER MELT

THE AFTERMATH WAS ROUGH…

• Immediately replaced all nodes with m2.2xlarges

• 4 cores

• 32 GB RAM

• No more multi-tenancy.

• Required nasty service migrations

• Began watching a lot of pending task metrics.

• Flushed blocker writers

• Dropped messages

Page 16: Apache Cassandra at Pager Duty 2014

10/20/14

Lessons Learned

WATCHING YOUR CASSANDRA CLUSTER MELT

• Cassandra has a steep performance degradation.

• Stay ahead of the scaling curve.

• Jump on any warning signs

• Practice scaling. Be able to do it on quick notice.

• Cassandra performance deteriorates with changes in the data set and

asynchronous, eventual consistency.

• Just because your latencies were one way doesn’t mean they’re

supposed to be that way.

• Don’t build for multi tenancy in your cluster.

Page 17: Apache Cassandra at Pager Duty 2014

10/20/14

PS. We’re hiring Cassandra people (enthusiast to expert) in our Realtime or Persistence teams.

Thank you.

http://www.pagerduty.com/company/work-with-us/

http://bit.ly/1ym8j9g