TestIstanbul 2015

42
Ensuring Performance in a Fast- Paced Environment Martin Spier Performance Engineering @ Netflix @spiermar [email protected]

Transcript of TestIstanbul 2015

Ensuring Performance in a Fast-Paced Environment

Martin SpierPerformance Engineering @ Netflix

@[email protected]

Martin Spier

● Performance Engineer @ Netflix● Previously @ Expedia and Dell● Performance

○ Architecture, Tuning and Profiling○ Testing and Frameworks○ Tool Development

● Blog @ http://overloaded.io● Twitter @spiermar

Agenda● How Things Worked

○ Pass/Fail Testing, Manual● How Netflix Works

○ Development Model, Freedom & Responsibility● Rethinking Performance

○ Tools, Methodologies, Canary Analysis, Performance Test Framework, Public Cloud, Automated Analysis

The Early Days

The Dawn of a New Era

Manual Automated

● World's leading Internet television network● ⅓ of all traffic heading into American homes at

peak hours● > 50 million members● > 40 countries● > 1 billion hours of TV shows and movies per

month● > 100s different client devices

● Culture deck* is TRUE○ 11M+ views

● Minimal process● Context over control● Root access to everything● No approvals required● Only Senior Engineers

Freedom and Responsibility

* http://www.slideshare.net/reed2001/culture-1798664

Independent Development Teams● Highly aligned, loosely coupled● Free to define release cycles● Free to choose use any methodology● But it’s an agile environment● And there is a “paved road”

Development Agility● Continuous innovation cycle● Shorter development cycles● Continuous delivery● Self-service deployments● A/B Tests● Failure cost close to zero● Lower time to market● Innovation > Risk

Cloud● Amazon’s AWS● Multi-region Active/Active● Ephemeral Instances● Auto-Scaling● Netflix OSS (https://github.com/Netflix)

● Not a part of any development team● Not a shared service● Through consultation improve and maintain the

performance● Provide self-service performance analysis utilities● Disseminate performance best practices

Performance Engineering

What about Performance?

Not Just AnotherChecklist Item!

● 5-6x Intraday● Auto-Scaling Groups (ASGs)● Reactive Auto-Scaling● Predictive Auto-Scaling (Scryer)

Auto-Scaling

Red/Black Pushes● New builds are rolled out as new

Auto-Scaling Groups (ASGs)● Elastic Load Balancers (ELBs)

control the traffic going to each ASG

● Fast and simple rollback if issues are found

● Canary Clusters are used to test builds before a full rollout

Squeeze Tests

● Stress Test, with Production Load● Steering Production Traffic● Understand the Upper Limits of Capacity● Adjust Auto-Scaling Policies● Automated Squeeze Tests

Simian Army● Ensures cloud handles failures

through regular testing● The Monkeys

○ Chaos Monkey: Resiliency○ Latency: Artificial Delays○ Conformity: Best-practices○ Janitor: Unused Instances○ Doctor: Health checks○ Security: Security Violations○ Chaos Gorilla: AZ Failure○ Chaos Kong: Region Failure

Canary Release

“Canary release is a technique to reduce the risk of introducing a new software version in

production by slowly rolling out the change to a small subset of users before rolling it out to the entire infrastructure and making it available to

everybody.”

Automatic Canary Analysis (ACA)

Exactly what the name implies. An automated way of analyzing a canary release.

ACA: Use Case● You are a service owner and have finished

implementing a new feature into your application.● You want to determine if the new build, v1.1, is

performing analogous to the existing build.● The new build is deployed automatically to a canary

cluster● A small percentage of production traffic is steered to the

canary cluster● After a short period of time, canary analysis

is triggered

Automated Canary Analysis● For a given set of metrics, ACA will compare

samples from control and canary;● Determine if they are analogous;● Identify any metrics that deviate from the

baseline;● And generate a score that indicates the overall

similarity of the canary.

Automated Canary Analysis● The score will be associated

with a Go/No-Go decision;● And the new build will be

rolled out (or not) to the rest of the production environment.

● No workload definitions● No synthetic load● No environment issues

When is it appropriate?

What about pre-production Performance

Testing?

Not always!

Sometimes it doesn't make sense to run performance tests.

Remember the short release cycles?

With the short time span between production builds, pre-production tests don’t warn us much sooner.

(And there’s ACA)

When it brings value. Not just because is part of a process.

So when?

When? Use Cases

● New Services● Initial Cluster Sizing● Large Code Refactoring● Architecture Changes● Workload Changes● Proof of Concept● Instance Type Migration

Use Cases, cont.

● Troubleshooting● Tuning● Teams that release less frequently

○ Intermediary Builds● Base Components (Paved Road)

○ Amazon Cloud Images (AMIs)○ Platform○ Common Libraries

Who?

● Push “tests” to development teams● Development understands the product, they

developed It● Performance Engineering knows the tools

and techniques (so we help!)● Easier to scale the effort!

How? Environment

● Free to create any environment configuration● Integration stack● Full production-like or scaled-down environment● Hybrid model

○ Performance + integration stack● Production testing

How? Monitoring

● We developed our own tools

● Commercial tools did not work for us

● Open source● Atlas and Vector

How? Test Framework ● Built around JMeter

How? Test Framework

● Runs on Amazon’s EC2● Leverages Jenkins for orchestration

How? Analysis

● In-house developed web analysis tool and API● Results persisted on Amazon’s S3 and RDS

How? Analysis

● Automated analysis built-in (thresholds)● Customized alerts● Interface with monitoring tools

Manual vs. Automated Analysis

Learn and Understand

Pass and Fail

vs.

Takeaways

● Canary analysis● Testing only when it brings VALUE● Leveraging cloud for tests● Automated test analysis● Pushing execution to development teams● Open source tools

Martin [email protected]

@spiermarhttp://overloaded.io/

SHIP IT