Velocity - cloudy with a chance of scaling
-
Upload
lee-atchison -
Category
Technology
-
view
414 -
download
0
Transcript of Velocity - cloudy with a chance of scaling
Cloudy with a Chance of ScalingA Guide to High Availability in the CloudLee Atchison, Principal Cloud Architect and Advocate at New Relic, Inc.
©2008-16 New Relic, Inc. All rights reserved.
2 ©2008-16 New Relic, Inc. All rights reserved.
Safe HarborThis document and the information herein (including any information that may be incorporated by reference) is provided for informational purposes only and should not be construed as an offer, commitment, promise or obligation on behalf of New Relic, Inc. (“New Relic”) to sell securities or deliver any product, material, code, functionality, or other feature. Any information provided hereby is proprietary to New Relic and may not be replicated or disclosed without New Relic’s express written permission.
Such information may contain forward-looking statements within the meaning of federal securities laws. Any statement that is not a historical fact or refers to expectations, projections, future plans, objectives, estimates, goals, or other characterizations of future events is a forward-looking statement. These forward-looking statements can often be identified as such because the context of the statement will include words such as “believes,” “anticipates,”, “expects” or words of similar import.
Actual results may differ materially from those expressed in these forward-looking statements, which speak only as of the date hereof, and are subject to change at any time without notice. Existing and prospective investors, customers and other third parties transacting business with New Relic are cautioned not to place undue reliance on this forward-looking information. The achievement or success of the matters covered by such forward-looking statements are based on New Relic’s current assumptions, expectations, and beliefs and are subject to substantial risks, uncertainties, assumptions, and changes in circumstances that may cause the actual results, performance, or achievements to differ materially from those expressed or implied in any forward-looking statement. Further information on factors that could affect such forward-looking statements is included in the filings we make with the SEC from time to time. Copies of these documents may be obtained by visiting New Relic’s Investor Relations website at http://ir.newrelic.com or the SEC’s website at www.sec.gov.
New Relic assumes no obligation and does not intend to update these forward-looking statements, except as required by law. New Relic makes no warranties, expressed or implied, in this document or otherwise, with respect to the information provided.
3 ©2008-16 New Relic, Inc. All rights reserved.
Who am I?
Lee AtchisonPrincipal Cloud Architectand Advocate
Specialize in:Cloud computingServices & Microservices
Scalability, Availability
29 years in industry7 in Amazon Retail & AWS(Built SW/VG AppStore, AWS Elastic Beanstalk)
4 in New Relic(Architecture Lead, Cloud, Service Migration)
@leeatchison leeatchison
4 ©2008-16 New Relic, Inc. All rights reserved.
I want to tell you a story…
5 ©2008-16 New Relic, Inc. All rights reserved.
I want to tell you a story…
You tell me if this is ok or not…
This was a recently overheard conversation…
6 ©2008-16 New Relic, Inc. All rights reserved.
Is this ok?
“We were wondering how changing a setting on
our MySQL database might impact our performance…
7 ©2008-16 New Relic, Inc. All rights reserved.
Is this ok?
“We were wondering how changing a setting on
our MySQL database might impact our performance…
… but we were worried that the change may cause our production
database to fail…”
8 ©2008-16 New Relic, Inc. All rights reserved.
Is this ok?“… Since we didn’t want to
bring down production, we decided to make the
change to our backup (replica) database
instead…
UnderConstruction
… but we were worried that the change may cause our production
database to fail…”
9 ©2008-16 New Relic, Inc. All rights reserved.
Is this ok?“… Since we didn’t want to
bring down production, we decided to make the
change to our backup (replica, hot standby)
database instead…
… After all, it wasn’t being used for anything
at the moment.”
UnderConstruction
10 ©2008-16 New Relic, Inc. All rights reserved.
Is this ok?Until, of course, the
backup was needed…
UnderConstructionX
11 ©2008-16 New Relic, Inc. All rights reserved.
Is this ok?Until, of course, the
backup was needed…
This was a true story
UnderConstruction!!!!X
X
I fly radio controlled model airplanes
“Keep your plane at least two mistakes high.”
There’s an old adage:
©2008-16 New Relic, Inc. All rights reserved. 12
“Keep your plane at least two mistakes high.”
©2008-16 New Relic, Inc. All rights reserved. 13
But Why?
Why Two Mistakes High?
You perform some stunt, and it fails… You lose altitude
©2008-16 New Relic, Inc. All rights reserved. 14
Why Two Mistakes High?
You perform some stunt, and it fails… You lose altitude
Now, you are lower, and you are trying to recover
©2008-16 New Relic, Inc. All rights reserved. 15
Why Two Mistakes High?
You perform some stunt, and it fails… You lose altitude
Now, you are lower, and you are trying to recoverYou want to still be high enough, so that if you make another mistake, you won’t crash
©2008-16 New Relic, Inc. All rights reserved. 16
Why Two Mistakes High?
You perform some stunt, and it fails… You lose altitude
Now, you are lower, and you are trying to recoverYou want to still be high enough, so that if you make another mistake, you won’t crash
©2008-16 New Relic, Inc. All rights reserved. 17
You always want to be high enough to make a mistake,
even if you’ve just made a mistake…
18 ©2008-16 New Relic, Inc. All rights reserved.
Put another way…
… even if you arecurrently recovering
from a mistake
…flying two mistakes high, you can always have
a backup plan for recovering from a mistake
19 ©2008-16 New Relic, Inc. All rights reserved.
Don’t screw up...
…while you are screwing up
This same applies when buildinghighly available, high scale applications
©2008-16 New Relic, Inc. All rights reserved. 20
21 ©2008-16 New Relic, Inc. All rights reserved.
How do we keep “Two Mistakes High” in an application?
Walk through ramifications and recovery
plan
22 ©2008-16 New Relic, Inc. All rights reserved.
How do we keep “Two Mistakes High” in an application?
Walk through ramifications and recovery
plan
Make sure recovery plan works
Has no mistakes
Has its own recovery plan
23 ©2008-16 New Relic, Inc. All rights reserved.
How do we keep “Two Mistakes High” in an application?
Walk through ramifications and recovery
plan
If recovery plan doesn’t work…
it’s not a good recovery plan
Make sure recovery plan works
Has no mistakes
Has its own recovery plan
24 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEHow many nodes do we need?
25 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEHow many nodes do we need?
How many nodes do I need to handle my traffic demands?
Building a Service Designed to handle 1,000 req/sec
(assume single node = 300 req/sec)
26 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEHow many nodes do we need?
Right???
ceil[1,000 / 300] = 4 nodes With four nodes, we can handle our
traffic PLUS we have enough nodes that
we can lose one! We have redundancy!
27 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEWell no…
You think 4 nodes gives you redundancy, but it doesn’t...
If you lose one of those nodes: Remaining nodes can only handle
300 * 3 = 900 req/sec Cannot handle the 1,000 req/sec
load
28 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEHow many do we need?
4 nodes... allows handling our traffic but we cannot handle a
node failure
5 nodes... allows handling
a single node failure
But…
No upgrading
6 nodes... a multi-node failure,
Or…
Handle a failureduring an upgrade
or more…
29 ©2008-16 New Relic, Inc. All rights reserved.
LESSONFly Two Mistakes High
Even if you think you have redundancy… Think through the failure modes … and make sure
30 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLERolling Deploys
31
What is a Rolling Deploy?
©2008-16 New Relic, Inc. All rights reserved.
Load Balancer
Server
Server
Server
Server
Server
32
What is a Rolling Deploy?
©2008-16 New Relic, Inc. All rights reserved.
Server
Server
Server
Server
Server
Remove one serverfrom service
Load Balancer
33
What is a Rolling Deploy?
©2008-16 New Relic, Inc. All rights reserved.
Server
Server
Server
Server
Server
Deploy new application version to this server
Load Balancer
34
What is a Rolling Deploy?
©2008-16 New Relic, Inc. All rights reserved.
Load Balancer
Server
Server
Server
Server
Server
Put back into service
35
What is a Rolling Deploy?
©2008-16 New Relic, Inc. All rights reserved.
Load Balancer
Server
Server
Server
Server
ServerRepeat 1 by 1
with remaining servers
Allows deploying changes to your servers without bringing your entire application down
36 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLERolling Deploys
Are you safe?
You need 10 nodes to run your application
You have 11 nodes, so that you can do rolling deploy Bring one node down at a
time to upgrade… Always at least 10
available...
37 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEWell no…
With the failed server to contend with… you have no room to do an upgrade or
rollback, and you are at risk for another failure
What if that node fails during upgrade?
What if you now have to roll back?
38 ©2008-16 New Relic, Inc. All rights reserved.
LESSONFly Two Mistakes High
Make sure you can handle failures Even during “exceptional” events,
such as upgrades Exceptional events can cause
failures
39 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEUnknown dependencies
? ?
40 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEUnknown dependencies
Are you safe?
You have your application running on 20 servers… You can run on 15 servers if
necessary Plenty of redundancy
41 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEWell, depends…
Are any of the
20 servers in the same
rack?
42 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEWell, depends…
Are any of the
20 servers in the same
rack?
Share the same power
supply?
43 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEWell, depends…
Are any of the
20 servers in the same
rack?
Share the same power
supply?
Share the same power
source?
44 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEWell, depends…
Are any of the
20 servers in the same
rack?
Share the same power
supply?
Share the same power
source?
Share the same A/C system?
45 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEWell, depends…
Are any of the
20 servers in the same
rack?
Share the same power
supply?
Share the same power
source?
Share the same A/C system?
The Cloud is not immune!
46 ©2008-16 New Relic, Inc. All rights reserved.
LESSONFly Two Mistakes High
Redundancy is not redundancy when the resources are not independent
47 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEFailure loop
48 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEFailure loop
Are you safe from power outages?
You live in an apartment… The apartment provides an enclosed
garage to store things in The power goes out in your place a
lot… ... you buy a generator, store it in
the garage
49 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEFailure loop
Oops
Oops… the garage: Has a single door, the big garage
door It has a garage door opener That requires electricity to open... The generator is only available...
when you already have power…
50 ©2008-16 New Relic, Inc. All rights reserved.
LESSONFly Two Mistakes High
Make sure your recovery plans actually are operational when you are in a failure mode
51 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEHigh redundancy in action
52 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEA real system…Great example:
Highlyindependent
Multi-levelerror recovery
Highly recoverable
system
Redundant
53 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEA real system…
In fact, one of the very first large scale software applications utilizing extreme
redundancy and failure management
Great example:
Highlyindependent
Multi-levelerror recovery
Highly recoverable
system
Redundant
54 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEWhat is this system?
55 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEUS Space Shuttle Program
They had problems…serious mechanical problems...
But the software system utilized state of the art:• Redundancy techniques• Error recovery techniques
56 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEUS Space Shuttle System
Five onboard computers Four were identical
(fifth talk about later) All four:
– Ran the exact same program during critical periods
– Given same data– Expected to generate
the same result
57 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEFour computers
Computers voted on the proper outcome
If any one computer did not generate the same results:
58 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEFour computers
Computers voted on the proper outcome
Those that disagreed with the outcome were turned off
for remainder of the flight
If any one computer did not generate the same results:
59 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEFour computers
Ultimate in democratic systems…
Computers voted on the proper outcome
Those that disagreed with the outcome were turned off
for remainder of the flight
If any one computer did not generate the same results:
60 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEFour computers
Could FLY with only THREE computers working
Could LAND with only TWO computers working
61 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEDeadlock
What if the four computers couldn’t decide?
(software bug or multiple failures)
62 ©2008-16 New Relic, Inc. All rights reserved.
EXAMPLEDeadlock
What if the four computers couldn’t decide?
(software bug or multiple failures)
Fifth computer was used as a tie breaker
Much simpler version of software… only used for key decisions
Software written by independent software team, unconnected with rest of software developers
(In theory) would not introduce same software errors…
©2008-16 New Relic, Inc. All rights reserved. 63
Highly Successful
30-year operation of Space Shuttle: Never a case where a serious life
threatening problem occurred that was a result of a software problem
Even though software was the most complex software ever built for a space program
64 ©2008-16 New Relic, Inc. All rights reserved.
US Space Shuttle
This is extreme (not needed by most projects) Shows what is possible... Independence is critical to high
availability
65 ©2008-16 New Relic, Inc. All rights reserved.
LESSONFly Two Mistakes High
Use availability
solution consistent
with the risk
66 ©2008-16 New Relic, Inc. All rights reserved.
LESSONFly Two Mistakes High
Use availability
solution consistent
with the risk
Higher the risk, higher the focus on availability
67 ©2008-16 New Relic, Inc. All rights reserved.
LESSONFly Two Mistakes High
Use availability
solution consistent
with the risk
Higher the risk, higher the focus on availability
Don’t over invest, don’t under invest
68 ©2008-16 New Relic, Inc. All rights reserved.
LESSONFly Two Mistakes High
Use availability
solution consistent
with the risk
Higher the risk, higher the focus on availability
Don’t over invest, don’t under invest
But think ahead, avoid the surprise
And remember…
“Keep your plane at least two mistakes high.”
©2008-16 New Relic, Inc. All rights reserved. 69
Architecting for ScaleBy: Lee AtchisonPublished by: O’Reilly Media, Available: June 2016www.architectingforscale.com
Preview edition available at New Relic booth
Want to Learn More?
Velocity Events“Static vs Dynamic Cloud”
Thursday 12noon, New Relic BoothOffice Hours
Thursday 3pm, O’Reilly BoothBook Signing
Today 2:30pm, O’Reilly BoothThroughout show, New Relic Booth
@leeatchison leeatchison
©2008-15 New Relic, Inc. All rights reserved.
Thank you.
Lee AtchisonPrincipal Cloud Architect and Advocate at New Relic, Inc.
Architecting for ScalePublished by: O’Reilly Media, Available: June 2016www.architectingforscale.com
@leeatchison leeatchison