Lessons from Automatic Incident Resolution for a Million Databases
-
Upload
greg-burek -
Category
Engineering
-
view
248 -
download
3
Transcript of Lessons from Automatic Incident Resolution for a Million Databases
![Page 1: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/1.jpg)
SRECon EU, July 2016Greg Burek
Lessons from Automatic Incident Resolution for a Million Databases
![Page 2: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/2.jpg)
![Page 3: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/3.jpg)
The Twelve-Factor App
![Page 4: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/4.jpg)
![Page 5: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/5.jpg)
![Page 6: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/6.jpg)
Department of Data
![Page 7: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/7.jpg)
PostgresqlRedisKafka
![Page 8: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/8.jpg)
~ Million Databases
Tens of thousands of AWS Instances
![Page 9: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/9.jpg)
Some Databases
Hundreds of AWS Instances
![Page 10: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/10.jpg)
![Page 11: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/11.jpg)
“The goal is to build systems that can scale
linearly with machines & sub-linearly with people” -
Caitie McCaffreyTackling Alert Fatigue
![Page 12: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/12.jpg)
Monitor and alert on your business
![Page 13: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/13.jpg)
![Page 14: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/14.jpg)
Monitor and alert on your business
Usually, don’t alert on machine specific metrics
![Page 15: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/15.jpg)
Write runbooks and playbooks
![Page 16: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/16.jpg)
Turn playbooks into code
![Page 17: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/17.jpg)
![Page 18: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/18.jpg)
“The goal is not to never get paged, the goal is to never get paged for the
same thing twice” - Astrid Atkinson
Engineering for the long game
![Page 19: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/19.jpg)
Verify monitoring before restarting the world
![Page 20: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/20.jpg)
![Page 21: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/21.jpg)
Circuit breakers
![Page 22: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/22.jpg)
Automation can’t handle the unknown
Wake someone up on exceptions and timeouts
![Page 23: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/23.jpg)
![Page 24: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/24.jpg)
Have a REPL/console
![Page 25: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/25.jpg)
Aggregate and review trends
![Page 26: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/26.jpg)
![Page 27: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/27.jpg)
Humans can break
Automation can be simplistic
Humans + Automation for a resilient and operable
system
![Page 28: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/28.jpg)
1.Monitor and alert on your business
2.Write playbooks3.Make playbooks into automation4.Checks and balances of
automation5.Circuit breakers6.Alert on exceptions and timeouts7.Admin console8.Aggregate and review trends
![Page 29: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/29.jpg)
[email protected]@gregburek
![Page 30: Lessons from Automatic Incident Resolution for a Million Databases](https://reader033.fdocuments.net/reader033/viewer/2022042907/587125e51a28abe4448b61bd/html5/thumbnails/30.jpg)
State Machines