SRE at Airbnb

40
SRE at Airbnb Cameron Tuckerman-Lee / DevOpsDays Shanghai / 2017-08-18

Transcript of SRE at Airbnb

SREatAirbnb

CameronTuckerman-Lee/DevOpsDaysShanghai/2017-08-18

CameronTuckerman-LeeAirbnbSiteReliabilityEngineer

SREatAirbnb

DevOps&SRE SREOrganization FutureofOps

HowdoyoucombinethecultureandspiritofDevOpswithanoperationsteam?

SREatAirbnb

DevOps&SRE SREOrganization FutureofOps

HowisSREatAirbnborganized?CloudInfraandReliabilitydeep-dive.

SREatAirbnb

DevOps&SRE SREOrganization FutureofOps

Operatorsshouldgrow,learn,andberecognizedforon-callwork,whilemaintainingpager-lifebalance.

DevOps&SRE

CentralizedOps

Positives

Reliabilitycanbeeasilyprioritized

Specializationofroles

Negatives

Operatorsunfamiliarwithcodebase

Tensionbetweenoperationsanddevelopment

CentralizedOperationsOrganization

CentralizedOps

DistributedOps

Positives

Agilitycanbeeasilyprioritized

Developersareincentivizedtobuildsystemsthatareeasytooperate(sincetheyaretheoperators!)

Negatives

Lackofspecialization--- devsareforcedtorelearndifficultlessonsover-and-over

Teamsspeakdifferentuptime/reliabilitylanguagestoeachother

DistributedOperations

CentralizedOps

DistributedOps

HybridApproach

Ableto'tune'abalancebetweenreliabilityandagility

Developersarestillexpectedtorunnormaloperationsfortheirservices==buildoperableservices

Centralizedoperationsorganizationcanbuildreusabletoolstomakeoperations/incidentresponseeasier.

Specializationofroleswithouttensionbetweenoperationsanddevelopmentteams.

Organizationthatunderstandandrecognizesthevalueinautomatingawaytheirjob.

HybridApproach:TwoPizzaTeams+SRETeam

BenTreynorVPEngineering,Google

Fundamentally,it'swhathappenswhenyouaskasoftwareengineertodesignanoperationsfunction...“

SREOrganization

WhatmakesupSREatAirbnb?

SiteReliabilityEngineeringismadeupofthreecomponents:

CloudInfrastructureManagesourtouchpointswithAWSandothercloudpartners

CoreReliabilityDevelopstoolsandprocessestoimproveoperations,reliability,andincidentresponseforallteams

EmbeddedReliabilityTemporaryembeddingofSREsinproductteamstoworkonspecificreliabilityoravailabilityfocusedprojects

CloudInfrastructure

RequirementsforEachIntegration

Monitoring

Alerting

SecurityApproval

Auditing

VersionUpgrades

AccessControl

...

Reliability

ThreePillarsofReliability

UptimeMeasurement Alerting&Detection IncidentResponse

Defense-in-depth:ourusersareprotectedfrombugsandregressionsbymultiplelayersofopinionatedalerts.

Engineerscancoordinateacrossteams,investigateproblemsinsystemstheydon'tfullyunderstand,andkeepstakeholdersup-to-date.

Everyteamatanytimeshouldbeabletoconfidently saywhethertheirserviceisworkingproperlyornot.

1.Uptime

Identifyquantifiablemetricswhicharerelatedtothehealthoftheirservices,called(ServiceLevelIndicatorsorSLI)

MakepublicandeasilydiscoverablepromisesaboutthebehaviorofyourserviceusingyourSLIs(ServiceLevelObjectivesorSLO)

TeamsreviewtheirservicescurrentSLIsandcomparethemtotheirpublishedSLOstomaketradeoffsbetweenreliabilityimprovementsandnewfeatures--- SLOsencodethetradeoffbetweenmovingfastandbreakingthings(Errorbudgets)

1.Uptime

2.Alerting

Alertingphilosophyshouldbeopinionated--- engineersknowwhatkindofalertstowriteandwhentowritethem

Alerts(likeconfiguration)shouldbecode

Practicedefenseindepth--- protectyourusersfrombugsandregressionswithlayersofalertslikeasecurityteamprotectsemployeesfrombeingcompromisedwithlayersofdefenses

1.Uptime

2.Alerting

3.Response

IncidentReporterTool

Mid-Incident

Engineerscaneffectivelycoordinate,evenacrossteams

Stakeholders(upstreamclients,management,employees)arekeptawareofupdates

WorkingonaSlackintegrationsoresponderscanstayinchatbutkeepthecompanyup-to-date

Post-Incident

Blamelesspostmortemprocess

Consistentimpactmeasurement(managementseesthatbetterincidentresponse+correctiveactionsmatterstothebottomline)

Easilysearchpastincidents/postmortems

FutureofOps

FutureofOps

Pager-LifeBalance:Ensurethatmoreinvolved,tenuredengineersaren’talwaystheoneswakingupat3AMto

putoutfires

Learning/GrowthFocused:Continuingeducationandlearningopportunitiesforon-callengineers

EvaluationMetrics:Engineersshouldknowwheretheycanimproveandshouldberecognizedforexcellentwork

IntelligentScheduling:InDevOpswheneveryteamhasatleasttwoon-callrotations,howcanweschedule

aroundlivesoutsideofwork(andresponsibilitiesinsideofwork)?

People-FirstOn-call

D e v O p s D a y s 2 0 1 7 · 上 海站

会议

• 8月18日 DevOpsDays 上海

• 全年 DevOps China 巡回沙龙

• 11月17日 DevOps金融上海

培训 咨询

• EXIN DevOps Master 认证培训

• DevOps 企业内训

• DevOps 公开课

• 互联网运维培训

• 企业DevOps 实践咨询

• 企业运维咨询

商务经理:刘静女士电话 /微信:13021082989邮箱:[email protected]