SRE at Airbnb
Transcript of SRE at Airbnb
SREatAirbnb
DevOps&SRE SREOrganization FutureofOps
HowdoyoucombinethecultureandspiritofDevOpswithanoperationsteam?
SREatAirbnb
DevOps&SRE SREOrganization FutureofOps
HowisSREatAirbnborganized?CloudInfraandReliabilitydeep-dive.
SREatAirbnb
DevOps&SRE SREOrganization FutureofOps
Operatorsshouldgrow,learn,andberecognizedforon-callwork,whilemaintainingpager-lifebalance.
CentralizedOps
Positives
Reliabilitycanbeeasilyprioritized
Specializationofroles
Negatives
Operatorsunfamiliarwithcodebase
Tensionbetweenoperationsanddevelopment
CentralizedOperationsOrganization
CentralizedOps
DistributedOps
Positives
Agilitycanbeeasilyprioritized
Developersareincentivizedtobuildsystemsthatareeasytooperate(sincetheyaretheoperators!)
Negatives
Lackofspecialization--- devsareforcedtorelearndifficultlessonsover-and-over
Teamsspeakdifferentuptime/reliabilitylanguagestoeachother
DistributedOperations
CentralizedOps
DistributedOps
HybridApproach
Ableto'tune'abalancebetweenreliabilityandagility
Developersarestillexpectedtorunnormaloperationsfortheirservices==buildoperableservices
Centralizedoperationsorganizationcanbuildreusabletoolstomakeoperations/incidentresponseeasier.
Specializationofroleswithouttensionbetweenoperationsanddevelopmentteams.
Organizationthatunderstandandrecognizesthevalueinautomatingawaytheirjob.
HybridApproach:TwoPizzaTeams+SRETeam
BenTreynorVPEngineering,Google
Fundamentally,it'swhathappenswhenyouaskasoftwareengineertodesignanoperationsfunction...“
WhatmakesupSREatAirbnb?
SiteReliabilityEngineeringismadeupofthreecomponents:
CloudInfrastructureManagesourtouchpointswithAWSandothercloudpartners
CoreReliabilityDevelopstoolsandprocessestoimproveoperations,reliability,andincidentresponseforallteams
EmbeddedReliabilityTemporaryembeddingofSREsinproductteamstoworkonspecificreliabilityoravailabilityfocusedprojects
RequirementsforEachIntegration
Monitoring
Alerting
SecurityApproval
Auditing
VersionUpgrades
AccessControl
...
ThreePillarsofReliability
UptimeMeasurement Alerting&Detection IncidentResponse
Defense-in-depth:ourusersareprotectedfrombugsandregressionsbymultiplelayersofopinionatedalerts.
Engineerscancoordinateacrossteams,investigateproblemsinsystemstheydon'tfullyunderstand,andkeepstakeholdersup-to-date.
Everyteamatanytimeshouldbeabletoconfidently saywhethertheirserviceisworkingproperlyornot.
1.Uptime
Identifyquantifiablemetricswhicharerelatedtothehealthoftheirservices,called(ServiceLevelIndicatorsorSLI)
MakepublicandeasilydiscoverablepromisesaboutthebehaviorofyourserviceusingyourSLIs(ServiceLevelObjectivesorSLO)
TeamsreviewtheirservicescurrentSLIsandcomparethemtotheirpublishedSLOstomaketradeoffsbetweenreliabilityimprovementsandnewfeatures--- SLOsencodethetradeoffbetweenmovingfastandbreakingthings(Errorbudgets)
1.Uptime
2.Alerting
Alertingphilosophyshouldbeopinionated--- engineersknowwhatkindofalertstowriteandwhentowritethem
Alerts(likeconfiguration)shouldbecode
Practicedefenseindepth--- protectyourusersfrombugsandregressionswithlayersofalertslikeasecurityteamprotectsemployeesfrombeingcompromisedwithlayersofdefenses
1.Uptime
2.Alerting
3.Response
IncidentReporterTool
Mid-Incident
Engineerscaneffectivelycoordinate,evenacrossteams
Stakeholders(upstreamclients,management,employees)arekeptawareofupdates
WorkingonaSlackintegrationsoresponderscanstayinchatbutkeepthecompanyup-to-date
Post-Incident
Blamelesspostmortemprocess
Consistentimpactmeasurement(managementseesthatbetterincidentresponse+correctiveactionsmatterstothebottomline)
Easilysearchpastincidents/postmortems
FutureofOps
Pager-LifeBalance:Ensurethatmoreinvolved,tenuredengineersaren’talwaystheoneswakingupat3AMto
putoutfires
Learning/GrowthFocused:Continuingeducationandlearningopportunitiesforon-callengineers
EvaluationMetrics:Engineersshouldknowwheretheycanimproveandshouldberecognizedforexcellentwork
IntelligentScheduling:InDevOpswheneveryteamhasatleasttwoon-callrotations,howcanweschedule
aroundlivesoutsideofwork(andresponsibilitiesinsideofwork)?
People-FirstOn-call
D e v O p s D a y s 2 0 1 7 · 上 海站
会议
• 8月18日 DevOpsDays 上海
• 全年 DevOps China 巡回沙龙
• 11月17日 DevOps金融上海
培训 咨询
• EXIN DevOps Master 认证培训
• DevOps 企业内训
• DevOps 公开课
• 互联网运维培训
• 企业DevOps 实践咨询
• 企业运维咨询
商务经理:刘静女士电话 /微信:13021082989邮箱:[email protected]