Velocity london 2012 bbc olympics

Post on 09-Dec-2014

4.029 views 1 download

Tags:

description

Talk for Velocity London 2012

Transcript of Velocity london 2012 bbc olympics

The BBC’s Experience of preparing for the 2012 London Olympics

For Velocity London 2012

Andy “Bob” BrockhurstPrincipal Engineer BBC Platforms/Frameworks

Introduction

• The Team• LAMP (without the M)• Tomcat Java Service Layer• Custom apache modules• Varnish with extensions• Zend Framework..– ...customised a.k.a PAL

• Barlesque

How the BBC works

• One domain[1]

• Two technology stacks[2]

• Cert’s and SSL• ProxyPass’• Apps are a TLD• 360+ apps• Everyone shares everything[3]

[1] Okay there are several but they are all really the same one.[2] Okay, three if you are going to be picky.[3] Yes really, everything!

Network Topology

• Dual DC[1]

• No DC affinity[2]

[1] One more soon(ish)[2] Well a couple of apps do[3]

[3] We don't talk about them

Network Topology

Traffic Routing

• TM -> PAL• TM -> Varnish -> TM -> PAL• TM -> Service Layer• TM -> Varnish -> TM -> Service Layer

Traffic Routing

Request iplayer | \ everything else sport | \ v \ TM .-> TM .--> TM .--> TM | / | / | / | | / | / | / | v / v / api v / v Varnish PAL Varnish Dynamite

Environments

• Integration• Test• Staging• Live• Journalism

Right let’s do some testing

Why?

• Too much change– Network Architecture– Server Configurations– Load balancers– Peering points

• High Profile• Gain confidence

Gaining Confidence

• Load testing on Stage– Tests individual applications– Single endpoints only– No concurrent load

• Real hardware• Real data– As much as possible

• Real Journalism

Other objectives

• Maintain BAU• Handle failure gracefully• Deliver Expectation

“What the Abdication did for

Radio

and the Coronation did for

Television,

London 2012 will do for

Online.”

Current Volumetrics

• Big numbers for sport– 9M users/day– 90M views/day

• Punishing peaks– Saturday football final scores 4000 pv/s–~750k Concurrent users

• Wimbledon– 1700 pv/s

Expected Volumetrics

• Expected peaks– 1.5M concurrent users– 60k different sports pages• 2,500 per minute

– 30% video via iPlayer

Timeline

• March 2012 (T minus 5 months)– Team members assigned– Resilience testing– Performance testing

• Testing with External Partner

Olympics Run-up

• Jubilee (2nd June)• Euros (8th June)• Wimbledon (25th June)• Formula One

Cloud Testing

• International testing • Detailed test results

Cloud Testing

• First performance test breaks live• Exposed monitoring issues• Couldn’t internally diagnose• Lots of tail, grep, awk, sed.

Early Findings

• Stop tests• Monitoring• UK Data centre capacity• UK Data centre network segments

(Not) Caching kills

• Conditional modules• Non-Olympics related modules– Commenting / Favourites

• Lowers cachability• Testing an immature product• Subsequent testing exposed more

What is a failure?

• Error 500?• Blank pages?• Stale content?• Slow pages?• Burning data centres?

Resilience Testing

• Kill backends• Traffic Manager– Screw with headers – Screw with status (418 anyone)– Truncate body

• Introduce waits• Limit cache sizes• Reduce network bandwidth

Early findings

• Failure mode testing– Everything is a SPOF– Performance sucks in a failure

Specific findings

• Monitoring Thresholds• Verbose logging, everywhere• Timeouts• No data• Volumetrics• Unfair load balancing

Verbose Logging

• Wrong levels configured• Diagnostic information• Expected/Handled errors• Too much detail• Hurts health/forensic reporting

Not enough logging

• Fatals with no logging • Unhandled conditions• Monitoring holes• Operations staff blind

Platform Configuration

• Unfair load-balancing– Remove older commodity servers

• Competitive service applications– Re-home critical applications

“Timeouts at lower levels in the

architecture MUST be set shorter

than the timeouts configured at

higher levels of the architecture.”

Timeouts

• Frontend/Backend timeouts– Frontends with lower timeouts– Caches never populated

• Alter backends to return early

More timeouts

• Unspecified timeouts• Wrongly specified timeout units–ms/sec

Poor Application Performance

• Multiple synchronous content requests

• International cachability• Missing negative caching– Bypassed shared caches

Testing frequency

• Every two weeks• Every week• Every other day

One week before…

The opening ceremony…

• 1st successful test on Live– with no errors at all.

Performance Overview

• Did find problems–Weren’t found on stage

• In all architecture layers• Components believed to be “fine”

were not• Stage is not suitable for this level of

testing• Proposal for any future “high profile”

event• CDNs didn’t really get tested

Resilience Overview

• Teams never tested failure scenarios• Assumed that services didn’t fail• Inconsistent use of flagpoles• Reliance on mod_cache stale-on-

error

Other problems

• Running a “fake” Olympics– That is invisible to the public– Did consider publicising a test

• No A/B (bucket) testing capability• Some tests affected BAU• No real test of the HLS HDS

streaming• Platform monitoring cycle

Other problems

• RCA complicated by shared platform• Testing stopped by BAU/TX • High reliance on key staff– Some tests suffered

• No CDN testing– At their request– Places unfair load on infrastructure

• Unable to simulate network congestion

Working with external tester

• Workflow testing differed– User journeys – Direct linking to hotspots

• Very responsive to altering tests• Did add extra complexity

Did it work

• YES– Found and fixed issues– Before they bit us– On production–With little impact on BAU

Recommendations

• Increase stage capacity• Intelligent load balancing• Test NFRs in Development• Caching, caching and more caching• Kill load tests quickly• Improve internal load testing• Profile frontends under load• Better post analysis tools

Some Statistics

Daily Reach (M)

Streaming Views (M) Wed 1st Aug

Unique Browsers (M)

Thanks for listening

• Thanks to flickr users:– dgjones

• Office Dalek, London, 14-10-06• http://www.flickr.com/photos/dgjones/284592369

– b3cft• Bombe rebuild detail• http://www.flickr.com/photos/b3cft/3797123899

– Karindalziel• Clouds• http://www.flickr.com/photos/nirak/644336486

– Enjoy Surveillance• What are you looking at?• http://www.flickr.com/photos/enjoy-surveillance/34795807/

– Solo• 45th Annual Watsonville Fly-in and Air Show• http://www.flickr.com/photos/donsolo/4959045491/in/photostream/

– SF Brit• Sunset over Iguazu• http://www.flickr.com/photos/cnbattson/4333692253/

• Olympics Photos: www.london2012.com• Other Photos: EpicWin, FailBlog, Haha-Business

Special Thanks

to:– David Holroyd• Technical Architect BBC Sport (Olympics)

–Matt Clark• Senior Technical Architect BBC Sport

Thanks for listening

• This presentation:– TBC

• Me:– Andy “Bob” Brockhurst– Twitter: b3cft (and pretty much anywhere online)

– www.kingkludge.net