Oscon2015 150724001540-lva1-app6891

77
Building A Successful Organization By Mastering Failure John Goulah (@johngoulah) Etsy

Transcript of Oscon2015 150724001540-lva1-app6891

Page 1: Oscon2015 150724001540-lva1-app6891

Building A Successful Organization By

Mastering FailureJohn Goulah (@johngoulah)

Etsy

Page 2: Oscon2015 150724001540-lva1-app6891
Page 3: Oscon2015 150724001540-lva1-app6891

Marketplace• $1.93B Annual GMS 2014

• 1.4M active sellers

• 20M+ active buyers

• 30% international GMS

• 57%+ mobile visits

Page 4: Oscon2015 150724001540-lva1-app6891

Infrastructure• over 5500 MySQL databases

• 750K graphite metrics/min

• 1.3GB logs written/min

• 50M - 75M gearman jobs / day

• 30-50 deploys / day

Page 5: Oscon2015 150724001540-lva1-app6891

Company• Headquartered in Brooklyn

• Over 700 employees

• 7 offices around the world

• 80+ dogs / 80+ cats

Page 6: Oscon2015 150724001540-lva1-app6891

Values

Page 7: Oscon2015 150724001540-lva1-app6891
Page 8: Oscon2015 150724001540-lva1-app6891

Learning Orga company that facilitates the learning of its members and continuously transforms itself

Page 9: Oscon2015 150724001540-lva1-app6891

Five Disciplines

Page 10: Oscon2015 150724001540-lva1-app6891

Systems Thinkingprocess of understanding how people, structure, and processes influence one another within a larger system

Page 11: Oscon2015 150724001540-lva1-app6891

Personal Masteryan individual holds great importance in a learning organization

Page 12: Oscon2015 150724001540-lva1-app6891

Mental Modelsthe assumptions held by individials and organizations

Page 13: Oscon2015 150724001540-lva1-app6891

Shared Visioncreates a common identity that provides focus and energy for learning

Page 14: Oscon2015 150724001540-lva1-app6891

Team Learningthe problem solving capacity of the organization is improved through better access to knowledge and expertise

Page 15: Oscon2015 150724001540-lva1-app6891

Learning About Failure• architecture reviews

• operability reviews

• blameless post mortems

Page 16: Oscon2015 150724001540-lva1-app6891

failure and success come from the same

source

Page 17: Oscon2015 150724001540-lva1-app6891

context

Page 18: Oscon2015 150724001540-lva1-app6891

can study the system at any time

Page 19: Oscon2015 150724001540-lva1-app6891

inflection points• architecture reviews

• early feedback and discussion

• operability reviews

• held before launching

• blameless post mortems

• held after a failure

Page 20: Oscon2015 150724001540-lva1-app6891

Architecture Reviews

Page 21: Oscon2015 150724001540-lva1-app6891

Architecture Reviewsunderstand the costs and benefits of a proposed solution, and discuss alternatives

Page 22: Oscon2015 150724001540-lva1-app6891

Etsy Tech Axioms• we use a small number of well known tools

• all technology decisions come with trade offs

• with new technology, many of those trade offs are unknown

• we’re growing. things change

Page 23: Oscon2015 150724001540-lva1-app6891

with new technologymany of those tradeoffs are unknown

Page 24: Oscon2015 150724001540-lva1-app6891

Departuresa departure is when new technologies or patterns are introduced that deviate from the current known methods of operating the system and maintaining the software

Page 25: Oscon2015 150724001540-lva1-app6891

How do I know I need an architecture review?when there is a perceived departure from current technology choices or patterns

Page 26: Oscon2015 150724001540-lva1-app6891

How early do you hold them?early enough to be able to bail out or make major course corrections

Page 27: Oscon2015 150724001540-lva1-app6891

Who should come?• the people presenting the change

• key stakeholders (sr. engineers, or arch review working group)

• everyone else that wants to learn about the proposed changes to the system

Page 28: Oscon2015 150724001540-lva1-app6891

Architecture ReviewMeeting Format

Page 29: Oscon2015 150724001540-lva1-app6891

Preparation• a proposal is written in a shared document and circulated

• comments are added, discussed, and potentially resolved in advance

• initial questions for the meeting are collected in a tool such as google moderator

Page 30: Oscon2015 150724001540-lva1-app6891

Some General Questions• Do we understand the costs of this departure?

• Have we asked hard questions about trade-offs?

• What will this prohibit us from doing in the future?

Page 31: Oscon2015 150724001540-lva1-app6891

Some General Questions (cont)• Are we impacting visibility, measurability, debuggability and

other operability concerns?

• Are we impacting testability, security, translatability, performance and other product quality concerns?

• Does it makes sense?

Page 32: Oscon2015 150724001540-lva1-app6891

The Arch Review• proposal is presented to the group

• discuss questions and concerns

• decide if we are moving forward or need further discussion

Page 33: Oscon2015 150724001540-lva1-app6891

you're saying my project might not

move forward?

Page 34: Oscon2015 150724001540-lva1-app6891

Why might this end a project?• we learned through this discussion that an alternative is

better

• we find goals overlap with other projects that are in progress

• we discover that it isn't worth the costs now that we have a better idea what they are

Page 35: Oscon2015 150724001540-lva1-app6891

At the end we should have• detailed notes from the conversation

• agreement on tricky components and document them

• a compilation of learnings and questions

• a decision of whether to keep going with the project, stop and rethink, or gather more information

Page 36: Oscon2015 150724001540-lva1-app6891

OperabilityReviews

Page 37: Oscon2015 150724001540-lva1-app6891

Operability Reviewsunderstand how the system could break, how we will know, and how we will react

Page 38: Oscon2015 150724001540-lva1-app6891

When do we do operability reviews?• after architecture reviews in the product lifecycle, generally

right before launch

• when we need to gain increased confidence for launch due to the technology, product, or communication choices being risky

• if there's a chance you'd surprise teams that operate the software

Page 39: Oscon2015 150724001540-lva1-app6891

Who comes to the operability review?representatives from:

• Product

• Development

• Operations

• Community/Support

• QA

Page 40: Oscon2015 150724001540-lva1-app6891

Some Questions• Has the feature been tested enough to deploy to

production?

• Does everyone know when it will go live, and who will push the feature?

• Is there communication about the feature ready to go out with the feature?

• Is it possible to turn up this feature on a percentage basis, dark launch, or gameday it?

Page 41: Oscon2015 150724001540-lva1-app6891

Some Questions (cont)• Does the launch involves any new production infrastructure?

• If so, are those pieces in monitoring or metrics collection?

• If so, is there a deployment pipeline in place?

• If so, is there a development environment set up to make it work in dev?

• If so, are there tests that can be and are run on CI?

Page 42: Oscon2015 150724001540-lva1-app6891

Contingency Checklist

Page 43: Oscon2015 150724001540-lva1-app6891

Contingency Checklista list of things that could possibly go "wrong" with a new feature, what we could do about it

Page 44: Oscon2015 150724001540-lva1-app6891

IssueWhat could possibly go wrong with the feature launched in production?

Page 45: Oscon2015 150724001540-lva1-app6891

LikelihoodWhat is the likelihood of each item going wrong?

Page 46: Oscon2015 150724001540-lva1-app6891

CommentsAny comments about the item?

Page 47: Oscon2015 150724001540-lva1-app6891

ImpactThis is just a measure of how impactful this will be if it does actually turn out to be a concern.

Page 48: Oscon2015 150724001540-lva1-app6891

EngineeringWhat do we do to mitigate the issue with the item (i.e. can we gracefully degrade?)

Page 49: Oscon2015 150724001540-lva1-app6891

Onsite MessagingWhat is the messaging to the user in the forums, blog, and social media if this needs graceful degradation?

Page 50: Oscon2015 150724001540-lva1-app6891

PRIs PR needed for the contingency (i.e. larger scale failure)

Page 51: Oscon2015 150724001540-lva1-app6891

BlamelessPost Mortems

Page 52: Oscon2015 150724001540-lva1-app6891

What is a post mortem?a postmortem is a facilitated meeting during which people involved/interested/close to an accident or incident debriefs together on how we think the event came about

Page 53: Oscon2015 150724001540-lva1-app6891

What does it cover?• walking through a timeline of events

• learning how things are expected to work "normally", adding the context of everyone’s perspective

• exploring what we might do to improve things for the future

Page 54: Oscon2015 150724001540-lva1-app6891

Local Rationalitywe want to know how it made sense for someone to do what they did at the time

Page 55: Oscon2015 150724001540-lva1-app6891

searching for second stories instead of human error• asking why is leading to who is responsible

• asking how leads to what

Page 56: Oscon2015 150724001540-lva1-app6891

Avoiding Human ErrorHuman error points directly to individuals in a complex system. But, in complex systems, system behaviour is driven fundamentally by the goals of the system and the system structure. People just provide the flexibility to make it work.

Page 57: Oscon2015 150724001540-lva1-app6891

Avoiding Human Error (cont)Human error implies deviation from “normal” or "ideal", but in complex situations and tasks there is often no normal ideal that can be precisely and exactly described, many variable interconnected touchpoints influence decisions that are made

Page 58: Oscon2015 150724001540-lva1-app6891

Recognizing Human Error• be aware of other terms for it: slip, lapse, distraction,

mistake, deviation, carelessness, malpractice, recklessness, violation, misjudgement, etc

• don’t point to individuals when you really want to understand system itself and the work

• how do you feel when something goes wrong?

• is it to find who did it / who screwed up, or to find how it happened?

Page 59: Oscon2015 150724001540-lva1-app6891

Other Things to Avoid

Page 60: Oscon2015 150724001540-lva1-app6891

Root Cause• it leads to a simplistic and linear explanation of how events

transpired

• linear mental models of causality don’t capture what is needed to improve the safety of a system

• ignores the complexity of an event, which is what should be explored if we are going to learn

• leads directly to blaming things on human error

Page 61: Oscon2015 150724001540-lva1-app6891

Nietzschean anxietywhen situations appear both threatening and ambiguous we seem to demand a clear causal agency; because if we cannot establish this agency then the "problem" is potentially irresolvable

Page 62: Oscon2015 150724001540-lva1-app6891

Hindsight Biasinclination, after an event has occurred, to see the event as having been predictable, despite there having been little or no objective basis for predicting it

Page 63: Oscon2015 150724001540-lva1-app6891

Counterfactualsthe human tendency to create possible alternatives to life events that have already occurred; something that is contrary to what actually happened

Page 64: Oscon2015 150724001540-lva1-app6891

Morguehttps://github.com/etsy/morgue

Page 65: Oscon2015 150724001540-lva1-app6891

Post MortemMeeting Format

Page 66: Oscon2015 150724001540-lva1-app6891

Meeting Format• Timeline

• Discussion

• Remediation Items

Page 67: Oscon2015 150724001540-lva1-app6891

Timeline• a rough timeline scaffolding is required

• talk about facts that were known at the time, even if hindsight reveals misunderstandings in what we knew

• look out for knowledge that some people were aware of, that others were not, and dig into that

• no judgement about actions or knowledge (counterfactuals)

• tell people to hold that thought if they jump to remediation items at this point

Page 68: Oscon2015 150724001540-lva1-app6891

Timeline (cont)• continually ask "What are we missing?" until those involved

feel its complete

• continually ask "Does everyone agree this is the order in which events took place?"

• make sure to include important times for events that happened (alerts, discoveries)

• reach a consensus on the timeline and move on to the discussion

Page 69: Oscon2015 150724001540-lva1-app6891

Discussion• When an action or decision was taken in the timeline, ask

the person: "Think back to what you knew at the time, why did that action make sense to you at the time?"

• Did we clean up anything after we were stable, how long did it take?

• Was there any troubleshooting fatigue?

Page 70: Oscon2015 150724001540-lva1-app6891

Discussion (cont)• Did we do a good job with communication (site status,

support, forums, etc)?

• Were all tools on hand and working, ready to use when we needed them during the issue? Where there tools we would have liked to have?

• Did we have enough metrics visibility to diagnose the issue?

• Was there collaborative and thoughtful communication during the issue?

Page 71: Oscon2015 150724001540-lva1-app6891

Remediation• Remediation items should have tickets associated with them

to follow up on

• There can be further post meeting discussion on these but tasks should not linger

Page 72: Oscon2015 150724001540-lva1-app6891

Remediation questions• What things could we do to prevent this exact thing from

happening in the future?

• What things could we do to make troubleshooting similar incidents in the future easier?

Page 73: Oscon2015 150724001540-lva1-app6891

In Summary

Page 74: Oscon2015 150724001540-lva1-app6891

We Can Learn Before and After Failure

Page 75: Oscon2015 150724001540-lva1-app6891

Before• Architecture reviews for new technology

• Operability reviews to gain launch confidence

Page 76: Oscon2015 150724001540-lva1-app6891

After• Postmortems are done soon after a failure

• avoid human error, counterfactuals, hindsight bias, and root cause

Page 77: Oscon2015 150724001540-lva1-app6891

Questions?John Goulah (@johngoulah)

Etsy