S.R.E - create ultra-scalable and highly reliable systems
-
Upload
ricardo-amaro -
Category
Engineering
-
view
124 -
download
4
Transcript of S.R.E - create ultra-scalable and highly reliable systems
![Page 1: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/1.jpg)
![Page 2: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/2.jpg)
S.R.E. create ultra-scalable and highly
reliable systemsRicardo Amaro
DevOps - https://events.drupal.org/node/13519
![Page 3: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/3.jpg)
Who am I?@Drupal
@ricardoamaro
Portugal
Lisbon
Drupal Community
Family
+8 years Drupal
90’s Linux Adopter
5 years at Acquia
Site Reliability Engineer,Senior Tier2 Ops
https://drup
al.org/user/66
6176
![Page 4: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/4.jpg)
About Acquia Metrics
○ Acquia Cloud:○ # of Instances (17,200+)○ # of Production Sites (54,000+)○ # API Calls (3,000 + per sec)○ # Of Availability Zones (20+)○ # Of Regions (8)
![Page 5: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/5.jpg)
We will talk aboutA brief summary inspired on Google’s S.R.E. book
○ What is S.R.E?○ Tenets of S.R.E.○ Reliability & Toil○ Error budget - keeping the Service Level Objective (SLO)○ Development & Operations○ Monitoring and Being On-Call○ Release Engineering○ Postmortem culture - Learning from failure
![Page 6: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/6.jpg)
What is S.R.E.?
![Page 7: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/7.jpg)
➔ Term crafted by Google in 2003.
➔ When Ben Treynor was hired to run “production” and ended up
“applying software engineering to an operations function”
➔ Motivation: “as a software engineer, how would I want to invest my time to accomplish a set of repetitive tasks?”
Site Reliability Engineering
![Page 8: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/8.jpg)
➔ SRE is taken seriously by major companies
Site Reliability Engineering
Microsoft
Apple
Amazon
![Page 9: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/9.jpg)
SRE’s are engineers that...
➔ Apply the principles of computer science and engineering to
design and develop large, distributed computing systems.
➔ Write software for those systems alongside product developers.
➔ Build all additional pieces those systems need, like backups and
load balancing.
➔ Reuse old solutions for new problems.
Site Reliability Engineering
![Page 10: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/10.jpg)
DevOps & S.R.E.
DevOps is a practice, which was coined around 2008, that encompasses automation of manual tasks, continuous integration and continuous delivery. It applies to a wide audience of companies whereas SRE might be considered a subset of DevOps that possesses additional skill sets.
Source: https://en.wikipedia.org/wiki/Site_reliability_engineering
![Page 11: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/11.jpg)
Tenets of S.R.E.
![Page 12: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/12.jpg)
Tenets of SRE
1. Ensuring a Durable Focus on Engineering2. Pursuing Maximum Change Velocity 3. Monitoring4. Emergency Response5. Change Management6. Demand Forecasting and Capacity Planning7. Provisioning8. Efficiency and Performance
![Page 13: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/13.jpg)
➔ Hire only coders➔ Have Service Level Objectives (SLOs) for your service➔ Measure and report performance against SLOs➔ Use Error Budgets and gate launches on them➔ Have a Common staffing pool for SRE and DEV➔ Excess Ops work overflows to DEV team➔ Cap SRE operational load at 50% and share 5% with the DEV team➔ On-call teams at least 8 or 6 people in rotation, per product➔ Maximum of 2 events per on-call shift➔ Post mortem for every event➔ Post mortems are BLAMELESS and focus on process and technology, not people
How to achieve S.R.E.Treynor’s Action items
IMPORTANT
IMPORTANT
![Page 14: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/14.jpg)
Reliability & Toil
![Page 15: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/15.jpg)
The latest feature or
That the product works?
What is most the important Feature of a product?
![Page 16: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/16.jpg)
How about the “503” feature ?
...most important thing is that the product works!
![Page 17: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/17.jpg)
“Reliability is the most fundamental feature of any product.”Ben Treynor, Google’s VP for 24/7 Operations
![Page 18: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/18.jpg)
The 80’s Waterfall software delivery model
Operations @customer ➔ *Provisioning➔ *Installing➔ *Upgrading➔ *Maintaining➔ *Backups/Restore➔ *Scaling
Source: wikipedia
![Page 19: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/19.jpg)
Then came the web...
● Software as a Service● Platform as a Service● Cloud computing ● ...
➔ Operations overhead not on the customer side➔ Features could now be delivered faster➔ Customer feedback important for product improvements
Product
DevelopmentShip Features
OperationsUsers
![Page 20: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/20.jpg)
Opposite rewarding conflicts
Objectives:➔ Ship new features➔ Launch new products
Objectives:➔ Reliability & Availability➔ Provision & Scale
Dev Ops
![Page 21: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/21.jpg)
The problem: Toil**exhausting labour
➔ Manual➔ Repetitive➔ Automatable➔ Tactical (Unplanned work)
➔ No enduring value➔ O(n) with service growth
(not just “work I don’t like to do.”)
![Page 22: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/22.jpg)
An Old Solution to Toil
Caption goes here
● Scale with bodiesIn the old operations model, you throw people at a reliability problem and keep pushing (sometimes for a year or more) until the problem either goes away or blows up in your face.
![Page 23: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/23.jpg)
Has your business succeeds workload tends to infinity
(x) time
● Cap Ops WorkloadBecause if you are successful and your business grows you need to reduce errors and toil. Put a 50% cap on Ops work and leave most of the SRE team time for writing code and reduce Toil.
(y) c
usto
mer
s/tr
affic
Workload/Toil over time
![Page 24: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/24.jpg)
➔ Keep operational work (i.e., toil) below 50% of each SREs time➔ More than 50% of each SREs time is spent on:
◆ Engineering project work to reduce toil ◆ Add service features - improving reliability, performance,
utilization➔ Improves career planning for the SRE➔ Improves morale on the organization➔ An SRE team can easily devolve into an Ops team if the 50% target
is broken
Why less Toil is Better?S.R.E. - A modern solution
not bad...
![Page 25: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/25.jpg)
S.R.E. - A modern solutionDEV + OPS
➔ This conflict is not inevitable➔ The solution is: Error Budgets!➔ Everyone agrees on an Error Budget (as we will explain next)➔ SRE only prevents releases or Launches if the Error Budget is exceeded.
Dev Ops
![Page 26: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/26.jpg)
error budgetkeeping the SLO
![Page 27: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/27.jpg)
➔ SLO - Service level objective is agreed as a means of measuring the performance of the
Service Provider.
➔ SLA - Service Level Agreement specifies what service is to be provided, how it is
supported, times, locations, costs, performance, and responsibilities of the parties
involved. SLOs are specific measurable characteristics of the SLA such as availability,
throughput, frequency, response time, or quality.
➔ SLI - Service Level Indicator is a measure of the service level provided by a service
provider to a customer. SLIs form the basis of Service Level Objectives (SLOs), which in
turn form the basis of Service Level Agreements (SLAs).
SLO, SLA & SLI Terminology
![Page 28: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/28.jpg)
What is an Error Budget?
The business or the product establishes Service Level Objectives (SLOs) for the system, based on Service Level indicators such as error rate, availability or latency...
Error Budget
Example: A 99.9% availability SLO means that the service can be 0.1% unavailable, which is the error budget.
100% - 99.9% = 0.1%
![Page 29: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/29.jpg)
➔ 100% is the wrong reliability target for basically everything.➔ Set a goal that acknowledges the trade-off and leaves an error budget➔ Error budget can be spent on anything: launching features, etc.➔ Error budget allows for discussion about how phased rollouts and 1%
experiments can maintain tolerable levels of errors.➔ Goal of SRE team isn’t “zero outages” – SRE and product devs are incentive
aligned to spend the error budget to get maximum feature velocity.
➔ Out of Budget? No problems. Do more testing between releases.
How to obtain the Error Budget
![Page 30: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/30.jpg)
➔ This puts an incentive to developers that drives them to value stability (not just change)
➔ And gives control that drives SREs to permit change (not just stability)
➔ It forces decisions based on metrics, not politics- nor feelings, just data
Error Budget A Self-regulating mechanism
![Page 31: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/31.jpg)
Development & Operations
![Page 32: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/32.jpg)
➔ Development and SRE teams share a
single staffing pool◆ If all is Reliable Devs are
rewarded with teammates
◆ If Ops is overloaded, SREs are
contracted to support code
How are Development & Operations teams organized?
Now tell me… Why should I hire you?
![Page 33: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/33.jpg)
Systems, code… Are you able to cook also?
➔ SREs are developer/sys-admin
hybrids
◆ They perform more Dev work as
things become stable
Development & Operations
Systems, code… Are you able to cook also?
![Page 34: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/34.jpg)
➔ SRE can only spend up to 50% of their
time on ops work
➔ If operational load exceeds 50%, the ops
work overflows to Dev
➔ Allow SRE to move to other projects
Highly motivated and effective teamwork
![Page 35: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/35.jpg)
Monitoring and Being On-Call
![Page 36: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/36.jpg)
➔ Three valid kinds of monitoring output◆ Alerts: human needs to take action immediately
● If you get a huge volume of critical email alerts disable them and stick with paging
◆ Tickets: human needs to take action eventually● On-call engineers can actually accomplish work when they aren’t being kept
up by pages at all hours. Ultimately, temporarily backing off on our alerts will allow you to make faster progress toward a better service
◆ Logging: no action needed
Monitoring and taking action
![Page 37: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/37.jpg)
➔ Maximum of 2 events per 8–12hour on-call shift➔ Handle the event accurately and quickly, clean up and restore
normal service➔ Conducting postmortems➔ If more than 2 events occur regularly per on-call shift,
problems can’t be investigated➔ Pager fatigue also won’t improve with scale➔ If they receive fewer than one event per shift, keeping them
on point is a waste of their time
Being On-Call
![Page 38: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/38.jpg)
➔ Monitoring should never require a human to interpret any part of the alerting domain
➔ The four golden signals of monitoring are latency, traffic, errors, and saturation. Start to focus on these four
“Don’t suggest, expose!”
Dashboards
![Page 39: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/39.jpg)
➔ An engineer can only react with urgency a
few times a day before they get fatigued
➔ Every page should be actionable
➔ Every page response should require
intelligence
➔ Pages should be about a new problem or
an event that hasn’t been seen before
Pager fatigueA serious a problem to be addressed
![Page 40: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/40.jpg)
Root Cause Analysis: The Core of Problem
Solving and Corrective
by Duke Okes
https://www.amazon.com/Root-Cause-Analysis-Problem-Corrective/
dp/0873897641
Find and eliminate all root causes
![Page 41: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/41.jpg)
➔ When humans are really necessary, thinking and recording the best practices ahead of time in a playbook or runbook improves 3x in the Mean Time To Repair (MTTR)
➔ SRE’s write and rely on on-call playbooks/runbooks
Example: http://docs.ansible.com/ansible/playbooks_intro.html
Playbooks/Runbooks
![Page 42: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/42.jpg)
A healthy monitoring and alerting pipeline should be simple and easy to reason about
Monitoring Conclusion
What do i do with this?
➔ Try always to have a high level stack overview
➔ Despite performance of services like databases often must be performed on the system itself
➔ A dashboard might also be paired with a log, in order to analyze historical correlations rapidly
![Page 43: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/43.jpg)
Release Engineering
![Page 44: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/44.jpg)
➔ All activities in between regular development and delivery of a software product to the end user: ◆ i.e., integration, build, test execution, packaging and delivery of software
➔ “Accelerating the path from development to operations”➔ A part of the SRE team where some more seasoned members are transitioned
there to conduct this highly important task➔ Is an internal service
What is Release Engineering?
![Page 45: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/45.jpg)
1. Use version control2. Use the right building tool(s) for the job 3. Write simple and portable build files 4. Use a release process that is reproducible (CI process)5. Use a package manager6. Define upgrade process before reaching 1.0 7. Create detailed logs of changes made 8. Do “Canary”9. Keep the big picture in mind
10. Apply these commands to yourself
10 Commandments of Release Engineering
![Page 46: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/46.jpg)
Collaboration
developers, SRE’s and release engineers work together
![Page 47: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/47.jpg)
Postmortem cultureLearning from failure
![Page 48: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/48.jpg)
➔ Document written for ALL significant incidents ➔ Non-paged incidents are even more valuable -
monitoring gaps➔ Explain what happened in detail ➔ Find all root causes of the event➔ Assign actions to correct the problem or improve how it
is addressed next time
What are Postmortems?
Postmortems?!
![Page 49: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/49.jpg)
Postmortems Are Blameless!
➔ Use a blame free postmortem culture, with the goal of exposing faults◆ Apply engineering to fix these faults ◆ Try not just avoid or minimize them
![Page 50: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/50.jpg)
Learn and teach with postmortems
Source: http://www.xkcd.com/1495/
![Page 51: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/51.jpg)
SERIOUSLY: BLAMELESS!
The Field Guide to Understanding Human Error
by Sidney Dekker
https://www.amazon.com/Field-Guide-Understanding-Human-Error/dp/0754648265
![Page 52: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/52.jpg)
Conclusions
![Page 53: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/53.jpg)
The S.R.E. Google Book and more resources
● https://g.co/SREBook
● There is now #SRE on @hangops Slack. https://t.co/btPgSGkGNz to join.
![Page 54: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/54.jpg)
QUESTIONS!
![Page 55: S.R.E - create ultra-scalable and highly reliable systems](https://reader031.fdocuments.net/reader031/viewer/2022021918/58a48ad91a28ab58738b668f/html5/thumbnails/55.jpg)
Evaluate This Session
THANK YOU!
WHAT DID YOU THINK?
We are hiring:https://www.acquia.com/careers/open-positions
https://events.drupal.org/node/13519