Architectural Patterns of Resilient Distributed Systems

74
! OMG Strangeloop 2015!! Architectural Patterns of Resilient Distributed Systems

Transcript of Architectural Patterns of Resilient Distributed Systems

Page 1: Architectural Patterns of Resilient Distributed Systems

! OMG Strangeloop 2015!!

Architectural Patterns of

Resilient Distributed

Systems

Page 2: Architectural Patterns of Resilient Distributed Systems

Ines Sombra

@Randommood [email protected]

Page 3: Architectural Patterns of Resilient Distributed Systems

Globally distributed & highly available

Page 4: Architectural Patterns of Resilient Distributed Systems

Today’s Journey Why care? Resilience literature Resilience in industry Conclusions

@randommood

Page 5: Architectural Patterns of Resilient Distributed Systems

OBLIGATORY DISCLAIMER SLIDE

All from a

practitioner’s perspective!

@randommood

Things you may see in this talk

Pugs Fast talking Life pondering Un-tweetable moments Rantifestos What surprised me this year Wedding factoids and trivia

Page 6: Architectural Patterns of Resilient Distributed Systems

Why Resilience?

Page 7: Architectural Patterns of Resilient Distributed Systems

How can I make a system more

resilient?@randommood

Page 8: Architectural Patterns of Resilient Distributed Systems

@randommood

Resilience is the ability of a system to adapt or

keep working when challenges occur

Page 9: Architectural Patterns of Resilient Distributed Systems

Defining ResilienceFault-tolerance Evolvability Scalability Failure isolation Complexity management

@randommood

Page 10: Architectural Patterns of Resilient Distributed Systems

It’s what really

matters@randommood

Page 11: Architectural Patterns of Resilient Distributed Systems

Resilience in Literature

lll

Page 12: Architectural Patterns of Resilient Distributed Systems

Harvest & Yield Model

Page 13: Architectural Patterns of Resilient Distributed Systems

@randommood

Fraction of successfully answered queries Close to uptime but more useful because it directly maps to user experience (uptime misses this) Focus on yield rather than uptime

Yield

Page 14: Architectural Patterns of Resilient Distributed Systems

@randommoodFrom Coda Hale’s “You can’t sacrifice partition tolerance”

Server A Server B Server C

Baby AnimalsCute

Harvest Fraction of the complete result

Page 15: Architectural Patterns of Resilient Distributed Systems

@randommoodFrom Coda Hale’s “You can’t sacrifice partition tolerance”

Server A Server B Server C

Baby AnimalsCute X66% harvest

Harvest Fraction of the complete result

Page 16: Architectural Patterns of Resilient Distributed Systems

@randommood

#1: Probabilistic Availability

Graceful harvest degradation under faults Randomness to make the worst-case & average-case the same Replication of high-priority data for greater harvest control Degrading results based on client capability

Page 17: Architectural Patterns of Resilient Distributed Systems

@randommood

#2 Decomposition & Orthogonality

Decomposing into subsystems independently intolerant to harvest degradation but the application can continue if they fail You can only provide strong consistency for the subsystems that need it Orthogonal mechanisms (state vs functionality)

Page 18: Architectural Patterns of Resilient Distributed Systems

@randommood

“If your system favors yield or harvest is an

outcome of its design”Fox & Brewer

Page 19: Architectural Patterns of Resilient Distributed Systems

Cook & Rasmussen model

Page 20: Architectural Patterns of Resilient Distributed Systems

Economic failure boundary

Unacceptable workload boundary

Accident boundary

Cook & Rasmussen

Operating point

Page 21: Architectural Patterns of Resilient Distributed Systems

Economic failure boundary

Unacceptable workload boundary

Accident boundary

Cook & Rasmussen

Page 22: Architectural Patterns of Resilient Distributed Systems

Economic failure boundary

Unacceptable workload boundary

Accident boundary Pressure

towards efficiency

Cook & Rasmussen

Page 23: Architectural Patterns of Resilient Distributed Systems

Economic failure boundary

Unacceptable workload boundary

Accident boundary Pressure

towards efficiency

Reduction of effort

Cook & Rasmussen

Page 24: Architectural Patterns of Resilient Distributed Systems

Economic failure boundary

Unacceptable workload boundary

Accident boundary Pressure

towards efficiency

Reduction of effort

Cook & Rasmussen

Incident!

Page 25: Architectural Patterns of Resilient Distributed Systems

Economic failure boundary

Unacceptable workload boundary

Accident boundary Pressure

towards efficiency

Reduction of effort

Safety Campaign

Cook & Rasmussen

Page 26: Architectural Patterns of Resilient Distributed Systems

Economic failure boundary

Unacceptable workload boundary

Accident boundary Pressure

towards efficiency

Reduction of effort

error margin

Marginal boundary

Safety Campaign

Cook & Rasmussen

Page 27: Architectural Patterns of Resilient Distributed Systems

error margin

Original marginal boundary

R.I.Cook - 2004

Acceptable operating point

Accident boundary

Flirting with the margin

Page 28: Architectural Patterns of Resilient Distributed Systems

error margin

Original marginal boundary

R.I.Cook - 2004

Acceptable operating point

Accident boundary

Flirting with the margin

Page 29: Architectural Patterns of Resilient Distributed Systems

error margin

Original marginal boundary

R.I.Cook - 2004

Acceptable operating point

Accident boundary

Flirting with the margin

Page 30: Architectural Patterns of Resilient Distributed Systems

error margin

Original marginal boundary

R.I.Cook - 2004

Acceptable operating point

Accident boundary

Flirting with the margin

Page 31: Architectural Patterns of Resilient Distributed Systems

error margin

Original marginal boundary

R.I.Cook - 2004

Acceptable operating point

Accident boundary

Flirting with the margin

Page 32: Architectural Patterns of Resilient Distributed Systems

error margin

Original marginal boundary

R.I.Cook - 2004

Acceptable operating point

Accident boundary

Flirting with the margin

Page 33: Architectural Patterns of Resilient Distributed Systems

R.I.Cook - 2004

Accident boundary

Flirting with the margin

New marginal boundary!

Page 34: Architectural Patterns of Resilient Distributed Systems

R.I.Cook - 2004

Accident boundary

Flirting with the margin

New marginal boundary!

Page 35: Architectural Patterns of Resilient Distributed Systems

@randommood

Insights from Cook’s model

Engineering resilience requires a model of safety based on: mentoring, responding, adapting, and learning System safety is about what can happen, where the operating point actually is, and what we do under pressure Resilience is operator community focused

Page 36: Architectural Patterns of Resilient Distributed Systems

@randommood

Engineering system resilience

Build support for continuous maintenance Reveal control of system to operators Know it’s going to get moved, replaced, and used in ways you did not intend Think about configurations as interfaces

Page 37: Architectural Patterns of Resilient Distributed Systems
Page 38: Architectural Patterns of Resilient Distributed Systems
Page 39: Architectural Patterns of Resilient Distributed Systems
Page 40: Architectural Patterns of Resilient Distributed Systems
Page 41: Architectural Patterns of Resilient Distributed Systems

Borrill's model

Page 42: Architectural Patterns of Resilient Distributed Systems

Traditional engineering Reactive

ops unk-unk

@randommood

Probability of failure

Rank

A system’s complexity

Cascading or catastrophic failures & you don’t know where they will come from! Same area as other 2 combined

Page 43: Architectural Patterns of Resilient Distributed Systems

Traditional engineering Reactive

ops unk-unk

@randommood

Failure areas need != strategies

Probability of failure

Rank

Page 44: Architectural Patterns of Resilient Distributed Systems

Traditional engineering Reactive

ops unk-unk

@randommood

Failure areas need != strategies

Probability of failure

Rank

Kingsbury

Page 45: Architectural Patterns of Resilient Distributed Systems

Traditional engineering Reactive

ops unk-unk

@randommood

Failure areas need != strategies

Probability of failure

Rank

Kingsbury

VS

Page 46: Architectural Patterns of Resilient Distributed Systems

Traditional engineering Reactive

ops unk-unk

@randommood

Failure areas need != strategies

Probability of failure

Rank

Kingsbury

Alvaro

VS

Page 47: Architectural Patterns of Resilient Distributed Systems

Strategies to build resilience

Code standards

Programming patterns

Testing (full system!)

Metrics & monitoring

Convergence to good state

Hazard inventories

Redundancies

Feature flags

Dark deploys

Runbooks & docs

Canaries

System verification

Formal methods

Fault injection

Classical engineering Reactive Operations Unknown-Unknown

The goal is to build failure domain independence

Page 48: Architectural Patterns of Resilient Distributed Systems

@randommood

“Thinking about building system resilience using a

single discipline is insufficient. We need different strategies”

Borrill

Page 49: Architectural Patterns of Resilient Distributed Systems

Wedding Trivia!!!

@randommood

Page 50: Architectural Patterns of Resilient Distributed Systems

Resilience in Industry

Page 51: Architectural Patterns of Resilient Distributed Systems

@randommood

Now with sparkles! ✨

Page 52: Architectural Patterns of Resilient Distributed Systems

@randommood

API inherently more vulnerable to any system failures or latencies in the stack Without fault tolerance: 30 dependencies w 99.99% uptime could result in 2+ hours of downtime per month! Leveraged client libraries

Page 53: Architectural Patterns of Resilient Distributed Systems

@randommood

Netflix’s resilient patternsAggressive network timeouts & retries. Use of Semaphores. Separate threads on per-dependency thread pools Circuit-breakers to relieve pressure in underlying systems Exceptions cause app to shed load until things are healthy

Page 54: Architectural Patterns of Resilient Distributed Systems

@randommood

We went on a diet just like you! #

Page 55: Architectural Patterns of Resilient Distributed Systems

$

$

Page 56: Architectural Patterns of Resilient Distributed Systems

@randommood

Key insights from Chubby

Library vs service? Service and client library control + storage of small data files with restricted operations Engineers don’t plan for: availability, consensus, primary elections, failures, their own bugs, operability, or the future. They also don’t understand Distributed Systems

Page 57: Architectural Patterns of Resilient Distributed Systems

@randommood

Key insights from Chubby

Centralized services are hard to construct but you can dedicate effort into architecting them well and making them failure-tolerant Restricting user behavior increased resilience Consumers of your service are part of your UNK-UNK scenarios

Page 58: Architectural Patterns of Resilient Distributed Systems

@randommood

And the family arrives!

Page 59: Architectural Patterns of Resilient Distributed Systems

@randommood

Key insights from Truce

Evolution of our purging system from v1 to v3 Used Bimodal Multicast (Gossip protocol) to provide extremely fast purging speed Design concerns & system evolution

Tyler McMullen Bruce Spang

Page 60: Architectural Patterns of Resilient Distributed Systems

Existing best practices

won’t save you

@randommood

Key insights from NetSys

João Taveira Araújo looking suave

Faild allows us to fail & recover hosts via MAC-swapping and ECMP on switches Do immediate or gradual host failure & recovery Watch Joao’s talk

Page 61: Architectural Patterns of Resilient Distributed Systems

Existing best practices

won’t save you

@randommood

Key insights from NetSys

João Taveira Araújo looking suave

Faild allows us to fail & recover hosts via MAC-swapping and ECMP on switches Do immediate or gradual host failure & recovery Watch Joao’s talk

Page 62: Architectural Patterns of Resilient Distributed Systems

Existing best practices

won’t save you

@randommood

Key insights from NetSys

João Taveira Araújo looking suave

Faild allows us to fail & recover hosts via MAC-swapping and ECMP on switches Do immediate or gradual host failure & recovery Watch Joao’s talk

Page 63: Architectural Patterns of Resilient Distributed Systems

@randommood

So we have a myriad of systems with different stages of evolution Resilient systems like Varnish, Powderhorn, and Faild have taught us many lessons but some applications have availability problems, why?

But wait a minute! ♥

Page 64: Architectural Patterns of Resilient Distributed Systems

@randommood

Everyone okay?

Page 65: Architectural Patterns of Resilient Distributed Systems

Resilient architectural

patterns

Page 66: Architectural Patterns of Resilient Distributed Systems

@randommood

Redundancies are key

Redundancies of resources, execution paths, checks, replication of data, replay of messages, anti-entropy build resilience Gossip / epidemic protocols too Capacity planning matters

Optimizations can make your

system less resilient!

Page 67: Architectural Patterns of Resilient Distributed Systems

@randommood

Unawareness of proximity to error boundary means we are always guessing Complex operations make systems less resilient & more incident-prone You design operability too!

Operations matter

Page 68: Architectural Patterns of Resilient Distributed Systems

@randommood

Complexity if increases safety is actually good Adding resilience may come at the cost of other desired goals (e.g. performance, simplicity, cost, etc)

Not all complexity is bad

Page 69: Architectural Patterns of Resilient Distributed Systems

@randommood

Leverage Engineering best practices

Resiliency and testing are correlated. TEST! Versioning from the start - provide an upgrade path from day 1 Upgrades & evolvability of systems is still tricky. Mixed-mode operations need to be common Re-examine the way we prototype systems

Page 70: Architectural Patterns of Resilient Distributed Systems

Bringing it together♥

Page 71: Architectural Patterns of Resilient Distributed Systems

tl;drOPERABILITYWHILE IN DESIGN UNK-UNK

Are we favoring harvest or yield?Orthogonality & decomposition FTW Do we have enough redundancies in place? Are we resilient to our dependencies?

Am I providing enough control to my operators? Would I want to be on call for this? Rank your services: what can be dropped, killed, deferred? Monitoring and alerting in place?

The existence of this stresses diligence on the other two areas Have we done everything we can? Abandon hope and resort to human sacrifices

♥ ♥

Theory matters!

Page 72: Architectural Patterns of Resilient Distributed Systems

IMPROVING OPERABILITYWHILE IN DESIGNTest dependency failures Code reviews != tests. Have both Distrust client behavior, even if they are internal Version (APIs, protocols, disk formats) from the start. Support mixed-mode operations. Checksum all the things Error handling, circuit breakers, backpressure, leases, timeouts

Automation shortcuts taken while in a rush will come back to haunt you Release stability is o"en tied to system stability. Iron out your deploy process Link alerts to playbooks Consolidate system configuration (data bags, config file, etc)

tl;dr♥ ♥

Operators determine resilience

Page 73: Architectural Patterns of Resilient Distributed Systems

@randommood

We can’t recover from lack of design. Not minding harvest/yield means we sign up for a redesign

the moment we finish coding.

TODAY’S RANTIFESTO♥ ♥

Page 74: Architectural Patterns of Resilient Distributed Systems

Thank you!github.com/Randommood/Strangeloop2015

77

Special thanks to

Paul Borrill, Jordan West, Caitie McCaffrey, Camille Fournier, Mike O'Neill, Neha Narula, Joao Taveira, Tyler McMullen, Zac Duncan, Nathan Taylor, Ian Fung, Armon Dadgard, Peter Alvaro, Peter Bailis, Bruce Spang, Matt Whiteley, Alex Rasmussen, Aysulu Greenberg, Elaine Greenberg, and Greg Bako.