Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for...

30
SRECon EU, July 2016 Greg Burek Lessons from Automatic Incident Resolution for a Million Databases

Transcript of Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for...

Page 1: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

SRECon EU, July 2016

Greg Burek

Lessons from Automatic Incident Resolution for a Million Databases

Page 2: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka
Page 3: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

The Twelve-Factor App

Page 4: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka
Page 5: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka
Page 6: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

Department of Data

Page 7: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

PostgresqlRedisKafka

Page 8: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

~ Million Databases

Tens of thousands of AWS Instances

Page 9: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

Some Databases

Hundreds of AWS Instances

Page 10: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka
Page 11: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

“The goal is to build systems that can scale linearly with

machines & sub-linearly with people” - Caitie McCaffrey

Tackling Alert Fatigue

Page 12: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

Monitor and alert on your business

Page 13: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka
Page 14: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

Monitor and alert on your business

Usually, don’t alert on machine specific metrics

Page 15: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

Write runbooks and playbooks

Page 16: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

Turn playbooks into code

Page 17: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka
Page 18: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

“The goal is not to never get paged, the goal is to never get

paged for the same thing twice” - Astrid Atkinson

Engineering for the long game

Page 19: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

Verify monitoring before restarting the world

Page 20: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka
Page 21: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

Circuit breakers

Page 22: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

Automation can’t handle the unknown

Wake someone up on exceptions and timeouts

Page 23: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka
Page 24: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

Have a REPL/console

Page 25: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

Aggregate and review trends

Page 26: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka
Page 27: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

Humans can break

Automation can be simplistic

Humans + Automation for a resilient and operable system

Page 28: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

1. Monitor and alert on your business2. Write playbooks3. Make playbooks into automation4. Checks and balances of automation5. Circuit breakers6. Alert on exceptions and timeouts7. Admin console8. Aggregate and review trends

Page 29: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

[email protected]@gregburek

Page 30: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka

State Machines