Design for Scale / Surge 2010

Post on 08-May-2015

2.925 views 3 download

description

Christopher Brown's surgecon2010 talk on resilient, scalable systems based on his work on Amazon's EC2 and the Opscode Platform.

Transcript of Design for Scale / Surge 2010

Copyright © 2010 Opscode, Inc - All Rights Reserved

‣ cb@opscode.com‣ @skeptomai‣ www.opscode.com

Christopher Brown VP, Engineering

1

Design for Scale

Copyright © 2010 Opscode, Inc - All Rights Reserved 2

Who am I?

Copyright © 2010 Opscode, Inc - All Rights Reserved 2

Who am I?

•Amazon EC2

Copyright © 2010 Opscode, Inc - All Rights Reserved 2

Who am I?

•Amazon EC2

•Microsoft Edge Computing Network

Copyright © 2010 Opscode, Inc - All Rights Reserved 2

Who am I?

•Amazon EC2

•Microsoft Edge Computing Network

•Opscode

Google, Amazon, Microsoftbuilt their own tools

Copyright © 2010 Opscode, Inc. – Confidential – Do Not Redistribute

P

almost everyone else is here...

... inexperienced or poorly equipped for the world in which we now operate.

4

The Method

http://www.flickr.com/photos/wonderlane/2090966628/sizes/l/

The Method

http://www.flickr.com/photos/wonderlane/2090966628/sizes/l/

Bootstrapping

The Method

http://www.flickr.com/photos/wonderlane/2090966628/sizes/l/

Bootstrapping

The Method

http://www.flickr.com/photos/wonderlane/2090966628/sizes/l/

Bootstrapping

Configuration

The Method

http://www.flickr.com/photos/wonderlane/2090966628/sizes/l/

Bootstrapping

Configuration

The Method

http://www.flickr.com/photos/wonderlane/2090966628/sizes/l/

Bootstrapping

Configuration

Command & Control

The Method

http://www.flickr.com/photos/wonderlane/2090966628/sizes/l/

Bootstrapping

Configuration

Command & ControlNanite!

Copyright © 2010 Opscode, Inc - All Rights Reserved 6

Got it?

Copyright © 2010 Opscode, Inc - All Rights Reserved 6

Got it?Defining the cloud is like this...

Copyright © 2010 Opscode, Inc - All Rights Reserved 7

Origin Myth of EC2

Copyright © 2010 Opscode, Inc - All Rights Reserved 7

Origin Myth of EC2

Copyright © 2010 Opscode, Inc - All Rights Reserved 7

Origin Myth of EC2

Copyright © 2010 Opscode, Inc - All Rights Reserved 7

Origin Myth of EC2

Copyright © 2010 Opscode, Inc - All Rights Reserved 7

Origin Myth of EC2

Dynamism

Dynamism...not about excess capacity...

Dynamism

Dynamism• Disintermediation• Developers can freely experiment

Dynamism• Disintermediation• Developers can freely experiment

• Isolation• Applications safely co-exist

Dynamism• Disintermediation• Developers can freely experiment

• Isolation• Applications safely co-exist

• Utilization• Best use of expensive resources

Dynamism• Disintermediation• Developers can freely experiment

This is what you are paying for

• Isolation• Applications safely co-exist

• Utilization• Best use of expensive resources

Scale

ScaleYou are not this BIG

ScaleYou are not this BIG

You are not that BIG

• LAMP can scale on generic architecture

• 2008 - Facebook has over 800 memcached servers, with 28 terabytes of RAM

• 2010 - Github has 16 physical machines, 128 cores, 288 GB RAM

• Don’t design for A Million Users

• Ship early, Ship ugly, Ship often!

You are not that BIG

• LAMP can scale on generic architecture

• 2008 - Facebook has over 800 memcached servers, with 28 terabytes of RAM

• 2010 - Github has 16 physical machines, 128 cores, 288 GB RAM

• Don’t design for A Million Users

• Ship early, Ship ugly, Ship often!

EC2 Design Principles• Minimize management footprint

• Run in VMs just like customers.

• Forced to analyze what must run in privileged space

• “Harden everything” means separate network traffic inside the datacenter – customers and management run there

• True multi-tenancy - Customers run side-by-side

• Design by Fight Club

• "You are not a beautiful and unique snowflake“

• “On a large enough time line, the survival rate for everyone will drop to zero.” 

http://www.flickr.com/photos/europedistrict/4058066840/

Copyright © 2010 Opscode, Inc - All Rights Reserved 13

• Simple API, single unit of work

• think of early Unix tools (MH)

• Can compose with other APIs

• Does not define policy / coupling

• Customers will surprise youPrimitives

Copyright © 2010 Opscode, Inc - All Rights Reserved 14

APIs, Mashups

Copyright © 2010 Opscode, Inc - All Rights Reserved 15

http://www.flickr.com/photos/jfseesthings/4293062294/sizes/l/

Simplify

• Move complexity “up the stack”

• Easier to debug

• “Simple and Open” wins

• OAuth, OpenID

• ATOM, REST

• Example: EC2 Metadata - HTTP

Cost

Cost• CapEx versus OpEx

Cost• CapEx versus OpEx

• The Cloud is not “Cheaper”

Cost• CapEx versus OpEx

• The Cloud is not “Cheaper”

• Do you have money, time, or experience?

Cost

What are you willing to pay for?

• CapEx versus OpEx

• The Cloud is not “Cheaper”

• Do you have money, time, or experience?

Copyright © 2010 Opscode, Inc - All Rights Reserved 17

Power

Copyright © 2010 Opscode, Inc - All Rights Reserved 17

Power

Copyright © 2010 Opscode, Inc - All Rights Reserved 17

Power

Nobody ever imagined a band of Orcs would steal a database table

Charles Stross - Halting State

MTTF & MTTRUnderstanding how, when and why things fail is great ... but

http://www.flickr.com/photos/dierken/948171048/sizes/z/

MTTF & MTTRUnderstanding how, when and why things fail is great ... but

If your Mean Time to Recover exceeds the time value of your data, your business is

DEAD

http://www.flickr.com/photos/dierken/948171048/sizes/z/

Testing

• Test with production-like dataset and performance

• Don’t do “Design by Laptop”

• A/B Testing

• API versioning

Pull the Plug

•Create test environment

•Pull the plug

•Document

•Pull the plug again!

http://www.flickr.com/photos/rosipaw/5033284534/sizes/m/in/photostream/

Pull the Plug

•Create test environment

•Pull the plug

•Document

•Pull the plug again!

http://www.flickr.com/photos/rosipaw/5033284534/sizes/m/in/photostream/

vs

Theo Morpheus

• Vertical vs Horizontal Scale

• Availability

• Reliability

• 99% vs 99.x% per unit?

vs

Theo Morpheus

Free your mind...

• Vertical vs Horizontal Scale

• Availability

• Reliability

• 99% vs 99.x% per unit?

vs

Theo Morpheus

Free your mind...

• Vertical vs Horizontal Scale

• Availability

• Reliability

• 99% vs 99.x% per unit?

vs

Theo Morpheus

You are not Theo

Free your mind...

• Vertical vs Horizontal Scale

• Availability

• Reliability

• 99% vs 99.x% per unit?

vs

Theo Morpheus

You are not Theo You’re probably not Morpheus either

Free your mind...

• Vertical vs Horizontal Scale

• Availability

• Reliability

• 99% vs 99.x% per unit?

vs

Theo Morpheus

You are not Theo You’re probably not Morpheus either

Availability• For a distributed system to be continuously

available, every request received by a non-failing node in the system must result in a response.

• “Read globally, Write locally" with inconsistent cache

• Service Level Agreements, even (especially?) internally

Think Globally, Act Locally

• Global but inconsistent aggregate view

• Local action where data is authoritative

• Autonomy

• “Rightsizing” your failure domain

http://www.flickr.com/photos/28634332@N05/3872137437/sizes/m/in/photostream/

Distributed Systems Design• Avoid execution caching

• “Don’t lie, don’t retry”

• Embrace failure

• Don’t block the client

• Avoid internal policy

• Ensure the system makes forward progress

Copyright © 2010 Opscode, Inc - All Rights Reserved 26

• It’s OK to apologize

• It’s better to completely fail for some users than penalize all of them

• The Web is all about “Hit Refresh”

Embrace Failure

Apologize...to Pat Helland

• Distributed Throttling

• Staged / Pipeline with back pressure

• Measure scalability at each stage

• Degraded performance

• Make progress for admitted requests

• At odds with “stateless” / session-less

Admission Control

http://www.flickr.com/photos/jayneandd/4450623309/sizes/m/in/photostream/

• Distributed Throttling

• Staged / Pipeline with back pressure

• Measure scalability at each stage

• Degraded performance

• Make progress for admitted requests

• At odds with “stateless” / session-less

Admission Control

http://www.flickr.com/photos/jayneandd/4450623309/sizes/m/in/photostream/

Make Forward Progress• MVCC, vector clocks, & reconciliation

• Don’t resurrect objects

• always go forward, never go back

• "name" is a property of an object, not its unique key

• Break the link, garbage collect later

• Model “degraded service” performance

Request Signing

• Stateless - no session tracking to lose or to purge later

• X509 - only public information on front-end boxes. More secure against exploit

• Shared secret - faster, smaller signature but requires secret info close to request front-end

Measure Monitor

Respond• Save *everything* *forever*

• Histograms / Pareto Chart

• tp99.9, tp99, and tp90

• ignore tp50, “average”

• http://en.wikipedia.org/wiki/Control_chart

• http://www.newrelic.com/

• http://www.splunk.com/

• skewness, kurtosis

Control Chart

• Day over Day

• Same Day, Year over Year

• Confidence Intervals

“Shewhart stressed that bringing a production process into a state of statistical control, where there is only common-cause variation, and keeping it in control, is necessary to predict future output and to manage a process economically.”

• http://en.wikipedia.org/wiki/Control_chart

Characteristic Curves

Periodicity

SLA, Variance, Troubleshooting

Data Taxonomy

• Precious

• Cachable

• Expensive

• Cheap

Consistency

• Authoritative vs. Consultative

• is_authorized? vs list group

Performance

• Call length

• Cyclomatic Complexity

• Request ID flow

• Vertical vs Horizontal Scale

• tension between unit performance and scalability

Failure Domains

• EC2 “droplets”

• EC2 DNS

• Coordinator zones

Copyright © 2010 Opscode, Inc - All Rights Reserved 39

Still with me?

Successes

•Sharable “AMI”s•Metadata (Simple and open again)•Open API ( think Eucalyptus)•No API throttling•Primitives•Pay-as you go•Free traffic between S3 and EC2•Data and Compute together

Failures• SOAP makes little girls cry

• Amazon Web Services, circa 2006 was > 75% REST or Query

• SOAP well supported by commercial vendors, with their libraries

• Still *Way* too hard to use.

• Commodity business. Driving the bottom out of cost causes quality to suffer.

• API vs UI?, User Experience in general

• IaaS (Infrastructure as a Service) is insufficient by itself

• a hangman's noose. EC2, and the other offerings,

Where are we going?