BigDoor's Jeff Malek Gluecon Presentation

33
@JPMALEK Retrospective from a startup built in the cloud : top 3 big lessons from the AWS outage on 04.21.2011 plus 4,369 other smaller ones 03/16/2022 1

Transcript of BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 1

Retrospective from a startup built in the cloud : top 3 big lessons

from the AWS outage on

04.21.2011 plus 4,369 other smaller ones

@JPMALEK

04/14/2023 2

What a country : entrepreneurial resiliency

@JPMALEK

04/14/2023 3

“robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs,

AWS, the BD API”

(true story)

@JPMALEK

04/14/2023 4

Boom

@JPMALEK

04/14/2023 5

good to be home!

Go Buffs

@JPMALEK

04/14/2023 6

me: previous startupteams in 3 countries

highly transactional systemMS tech : IIS/MS SQL Server

co-located, leased/owned hardware0% in cloud

$75M/yearly rev

@JPMALEK

04/14/2023 7

me : current startupsystems 100% on AWS

99% free/open-source software

standing on the shoulders of giants

@JPMALEK

04/14/2023 8

fault tolerance: 3 to 47 important failearnings

and 4,369 less important ones

@JPMALEK

04/14/2023 9

in the context of our startup, of course

YMMV depending on velocity

@JPMALEK

04/14/2023 10

Ruger

@JPMALEK

04/14/2023 11

The Ruger Fault Equivalency

time = money

fault tolerance = time²  - risk tolerance

Also known as:

'Fast, good and cheap : pick two‘

@JPMALEK

04/14/2023 12

system design philosophy:leverage proven, open-source tech

in the cloudto build ascaleablereliablesecure

operational foundationquickly

@JPMALEK

04/14/2023 13

So how do you achievethe right level of fault tolerance in

the cloud?

3 tenets

@JPMALEK

04/14/2023 14

Tenet #1

Scripted Repeatability Tenet #2

SPOF Elimination Tenet #3

Clear-Cut Communication

@JPMALEK

04/14/2023 15

who here has used AWS?

@JPMALEK

04/14/2023 16

Tenet #1prepare a fault-tolerant foundation with

scripted repeatability

aka automation

@JPMALEK

04/14/2023 17

from the start :script the non-interactive install of your tools

and OS

custom AMIDebian : great package management

based on Eric Hammond’s workhttp://alestic.com/

@JPMALEK

04/14/2023 18

which will allow you toscript the setup/tear-down of your stack

@JPMALEK

04/14/2023 19

which will allow you toscript system tests

integrity (3-4K tests)performance (30-40K tests)

load, capacity (2-4M requests)

@JPMALEK

04/14/2023 20

A/B system test results : MySQL Percona Upgrade

@JPMALEK

04/14/2023 21

That’s how1 person

set up andmanaged a network

comprised of 90+/- server instancesfor 1.5 years

while serving various other roleswithout having to leave their chair

try that with real hardware

@JPMALEK

04/14/2023 22

Tenet #2SPOF Elimination

We don’t need no stinkin single points of failure.

@JPMALEK

04/14/2023 23

SPOF Examples:Cloud Provider

RegionZone

Load BalancerApp Server

DatabaseFred

@JPMALEK

04/14/2023 24

Cloud Provider fail-over?

e.g. AWS –> Rackspace

@JPMALEK

04/14/2023 25

Region fail-over?

e.g. useast->uswest within AWSNah.

@JPMALEK

04/14/2023 26

Zone fail-over?Yes.

US-WEST A

BC

D

US-EAST A

BC

D

@JPMALEK

04/14/2023 27

Zone fail-over best practices:are you using auto-scaling?

no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics

@JPMALEK

04/14/2023 28

Load-balancer (ELB), app server, database fail-over?

Yes.

@JPMALEK

04/14/2023 29

So it’s actually all about reduction of the right SPOFs for

your business context

Just adding the ability to fail-over and have backups within a region is huge!

Probably enough for most.What about Fred?

@JPMALEK

04/14/2023 30

Tenet #3Clear-Cut Communication

transparency is soooo 2010

@JPMALEK

04/14/2023 31

During an outage, communicating the right things at the right time:

hard.But not that hard.

@JPMALEK

04/14/2023 32

Tenet #1

Scripted Repeatability Tenet #2

SPOF Elimination Tenet #3

Clear-Cut Communication

Three Tenets Revisited

@JPMALEK

04/14/2023 33

Notes