Velocity london 2012 bbc olympics

62
The BBC’s Experience of preparing for the 2012 London Olympics For Velocity London 2012 Andy “Bob” Brockhurst Principal Engineer BBC Platforms/Frameworks

description

Talk for Velocity London 2012

Transcript of Velocity london 2012 bbc olympics

Page 1: Velocity london 2012 bbc olympics

The BBC’s Experience of preparing for the 2012 London Olympics

For Velocity London 2012

Andy “Bob” BrockhurstPrincipal Engineer BBC Platforms/Frameworks

Page 2: Velocity london 2012 bbc olympics

Introduction

• The Team• LAMP (without the M)• Tomcat Java Service Layer• Custom apache modules• Varnish with extensions• Zend Framework..– ...customised a.k.a PAL

• Barlesque

Page 3: Velocity london 2012 bbc olympics

How the BBC works

• One domain[1]

• Two technology stacks[2]

• Cert’s and SSL• ProxyPass’• Apps are a TLD• 360+ apps• Everyone shares everything[3]

[1] Okay there are several but they are all really the same one.[2] Okay, three if you are going to be picky.[3] Yes really, everything!

Page 4: Velocity london 2012 bbc olympics
Page 5: Velocity london 2012 bbc olympics

Network Topology

• Dual DC[1]

• No DC affinity[2]

[1] One more soon(ish)[2] Well a couple of apps do[3]

[3] We don't talk about them

Page 6: Velocity london 2012 bbc olympics

Network Topology

Page 7: Velocity london 2012 bbc olympics

Traffic Routing

• TM -> PAL• TM -> Varnish -> TM -> PAL• TM -> Service Layer• TM -> Varnish -> TM -> Service Layer

Page 8: Velocity london 2012 bbc olympics

Traffic Routing

Request iplayer | \ everything else sport | \ v \ TM .-> TM .--> TM .--> TM | / | / | / | | / | / | / | v / v / api v / v Varnish PAL Varnish Dynamite

Page 9: Velocity london 2012 bbc olympics

Environments

• Integration• Test• Staging• Live• Journalism

Page 10: Velocity london 2012 bbc olympics

Right let’s do some testing

Page 11: Velocity london 2012 bbc olympics
Page 12: Velocity london 2012 bbc olympics

Why?

• Too much change– Network Architecture– Server Configurations– Load balancers– Peering points

• High Profile• Gain confidence

Page 13: Velocity london 2012 bbc olympics

Gaining Confidence

• Load testing on Stage– Tests individual applications– Single endpoints only– No concurrent load

• Real hardware• Real data– As much as possible

• Real Journalism

Page 14: Velocity london 2012 bbc olympics

Other objectives

• Maintain BAU• Handle failure gracefully• Deliver Expectation

Page 15: Velocity london 2012 bbc olympics

“What the Abdication did for

Radio

and the Coronation did for

Television,

London 2012 will do for

Online.”

Page 16: Velocity london 2012 bbc olympics

Current Volumetrics

• Big numbers for sport– 9M users/day– 90M views/day

• Punishing peaks– Saturday football final scores 4000 pv/s–~750k Concurrent users

• Wimbledon– 1700 pv/s

Page 17: Velocity london 2012 bbc olympics

Expected Volumetrics

• Expected peaks– 1.5M concurrent users– 60k different sports pages• 2,500 per minute

– 30% video via iPlayer

Page 18: Velocity london 2012 bbc olympics
Page 19: Velocity london 2012 bbc olympics

Timeline

• March 2012 (T minus 5 months)– Team members assigned– Resilience testing– Performance testing

• Testing with External Partner

Page 20: Velocity london 2012 bbc olympics

Olympics Run-up

• Jubilee (2nd June)• Euros (8th June)• Wimbledon (25th June)• Formula One

Page 21: Velocity london 2012 bbc olympics
Page 22: Velocity london 2012 bbc olympics

Cloud Testing

• International testing • Detailed test results

Page 23: Velocity london 2012 bbc olympics

Cloud Testing

• First performance test breaks live• Exposed monitoring issues• Couldn’t internally diagnose• Lots of tail, grep, awk, sed.

Page 24: Velocity london 2012 bbc olympics

Early Findings

• Stop tests• Monitoring• UK Data centre capacity• UK Data centre network segments

Page 25: Velocity london 2012 bbc olympics

(Not) Caching kills

• Conditional modules• Non-Olympics related modules– Commenting / Favourites

• Lowers cachability• Testing an immature product• Subsequent testing exposed more

Page 26: Velocity london 2012 bbc olympics
Page 27: Velocity london 2012 bbc olympics

What is a failure?

• Error 500?• Blank pages?• Stale content?• Slow pages?• Burning data centres?

Page 28: Velocity london 2012 bbc olympics

Resilience Testing

• Kill backends• Traffic Manager– Screw with headers – Screw with status (418 anyone)– Truncate body

• Introduce waits• Limit cache sizes• Reduce network bandwidth

Page 29: Velocity london 2012 bbc olympics
Page 30: Velocity london 2012 bbc olympics

Early findings

• Failure mode testing– Everything is a SPOF– Performance sucks in a failure

Page 31: Velocity london 2012 bbc olympics

Specific findings

• Monitoring Thresholds• Verbose logging, everywhere• Timeouts• No data• Volumetrics• Unfair load balancing

Page 32: Velocity london 2012 bbc olympics

Verbose Logging

• Wrong levels configured• Diagnostic information• Expected/Handled errors• Too much detail• Hurts health/forensic reporting

Page 33: Velocity london 2012 bbc olympics

Not enough logging

• Fatals with no logging • Unhandled conditions• Monitoring holes• Operations staff blind

Page 34: Velocity london 2012 bbc olympics

Platform Configuration

• Unfair load-balancing– Remove older commodity servers

• Competitive service applications– Re-home critical applications

Page 35: Velocity london 2012 bbc olympics

“Timeouts at lower levels in the

architecture MUST be set shorter

than the timeouts configured at

higher levels of the architecture.”

Page 36: Velocity london 2012 bbc olympics
Page 37: Velocity london 2012 bbc olympics

Timeouts

• Frontend/Backend timeouts– Frontends with lower timeouts– Caches never populated

• Alter backends to return early

Page 38: Velocity london 2012 bbc olympics

More timeouts

• Unspecified timeouts• Wrongly specified timeout units–ms/sec

Page 39: Velocity london 2012 bbc olympics
Page 40: Velocity london 2012 bbc olympics

Poor Application Performance

• Multiple synchronous content requests

• International cachability• Missing negative caching– Bypassed shared caches

Page 41: Velocity london 2012 bbc olympics

Testing frequency

• Every two weeks• Every week• Every other day

Page 42: Velocity london 2012 bbc olympics
Page 43: Velocity london 2012 bbc olympics

One week before…

The opening ceremony…

• 1st successful test on Live– with no errors at all.

Page 44: Velocity london 2012 bbc olympics
Page 45: Velocity london 2012 bbc olympics

Performance Overview

• Did find problems–Weren’t found on stage

• In all architecture layers• Components believed to be “fine”

were not• Stage is not suitable for this level of

testing• Proposal for any future “high profile”

event• CDNs didn’t really get tested

Page 46: Velocity london 2012 bbc olympics

Resilience Overview

• Teams never tested failure scenarios• Assumed that services didn’t fail• Inconsistent use of flagpoles• Reliance on mod_cache stale-on-

error

Page 47: Velocity london 2012 bbc olympics

Other problems

• Running a “fake” Olympics– That is invisible to the public– Did consider publicising a test

• No A/B (bucket) testing capability• Some tests affected BAU• No real test of the HLS HDS

streaming• Platform monitoring cycle

Page 48: Velocity london 2012 bbc olympics

Other problems

• RCA complicated by shared platform• Testing stopped by BAU/TX • High reliance on key staff– Some tests suffered

• No CDN testing– At their request– Places unfair load on infrastructure

• Unable to simulate network congestion

Page 49: Velocity london 2012 bbc olympics
Page 50: Velocity london 2012 bbc olympics

Working with external tester

• Workflow testing differed– User journeys – Direct linking to hotspots

• Very responsive to altering tests• Did add extra complexity

Page 51: Velocity london 2012 bbc olympics

Did it work

• YES– Found and fixed issues– Before they bit us– On production–With little impact on BAU

Page 52: Velocity london 2012 bbc olympics

Recommendations

• Increase stage capacity• Intelligent load balancing• Test NFRs in Development• Caching, caching and more caching• Kill load tests quickly• Improve internal load testing• Profile frontends under load• Better post analysis tools

Page 53: Velocity london 2012 bbc olympics

Some Statistics

Page 54: Velocity london 2012 bbc olympics

Daily Reach (M)

Page 55: Velocity london 2012 bbc olympics

Streaming Views (M) Wed 1st Aug

Page 56: Velocity london 2012 bbc olympics

Unique Browsers (M)

Page 57: Velocity london 2012 bbc olympics
Page 58: Velocity london 2012 bbc olympics

Thanks for listening

• Thanks to flickr users:– dgjones

• Office Dalek, London, 14-10-06• http://www.flickr.com/photos/dgjones/284592369

– b3cft• Bombe rebuild detail• http://www.flickr.com/photos/b3cft/3797123899

– Karindalziel• Clouds• http://www.flickr.com/photos/nirak/644336486

– Enjoy Surveillance• What are you looking at?• http://www.flickr.com/photos/enjoy-surveillance/34795807/

– Solo• 45th Annual Watsonville Fly-in and Air Show• http://www.flickr.com/photos/donsolo/4959045491/in/photostream/

– SF Brit• Sunset over Iguazu• http://www.flickr.com/photos/cnbattson/4333692253/

• Olympics Photos: www.london2012.com• Other Photos: EpicWin, FailBlog, Haha-Business

Page 59: Velocity london 2012 bbc olympics

Special Thanks

to:– David Holroyd• Technical Architect BBC Sport (Olympics)

–Matt Clark• Senior Technical Architect BBC Sport

Page 60: Velocity london 2012 bbc olympics

Thanks for listening

• This presentation:– TBC

• Me:– Andy “Bob” Brockhurst– Twitter: b3cft (and pretty much anywhere online)

– www.kingkludge.net

Page 61: Velocity london 2012 bbc olympics
Page 62: Velocity london 2012 bbc olympics