Post on 13-Aug-2015
berlin aws meetup: here.com on awsImplementation timeline, pitfalls and lessons learned
Cristian Măgherușan-Stanciu<cristian.magherusan-stanciu@here.com>@magheru_san
June 16, 2015
about here
HERE is a leading location company
∙ Over 6000 employees in 55 countries∙ We own our map data
∙ state-of-the-art offline capabilities∙ global map coverage∙ weekly map updates∙ location-based services around it
∙ Market leader in automotive∙ in 4 out of 5 cars sold in the Western hemisphere
∙ Powering a myriad of partners, including
∙ Free apps on major mobile platforms
3
here in berlin
About us
∙ 820 internal employees and hiring :-)∙ 56 countries and 25 languages, 36% germans∙ 20% female∙ Average age 36 years∙ Interesting mix of start-up/enterprise culture∙ AWS-first policy for all our services
4
about here.com
The main consumer website of HERE
∙ Designed to seamlessly integrate with, and complement thenative mobile apps
∙ Reference implementation for many capabilities∙ Re-written from scratch since Fall 2013∙ Modern technology stack
5
about here.com
About the new version of here.com
∙ Re-launched only 6 months ago∙ Monthly page loads in the tens of millions∙ Traffic growing fast, already 3x since the re-launch∙ Hosted on AWS
6
oct 2013 - first commits
Simple way to run the application
∙ Relatively easy to bootstrap∙ reused as much as we could from our hello_world skeleton service
∙ Application running on EC2 instances∙ Single AWS region∙ Users would connect directly to the ELB∙ All AWS infrastructure defined using CloudFormation
∙ stacks based on a reused hello_world template
∙ Primitive continuous delivery pipeline: Jenkins, Puppet, cron
10
dec 2013 - first internal release
’Backstage’ launch
∙ Shared with all HERE employees∙ A few hundreds of daily users∙ Started to get valuable feedback, mostly about the UX∙ Production configuration snapshot-ed manually before thelaunch
∙ No major architecture changes
11
jan 2014 - infrastructure improvements
Deployment orchestration changes
∙ Fully controlled by Jenkins via ec2_collective/SQS∙ Production deployments triggered automatically after everycommit
∙ no longer relying on cron∙ we can easily see deployment failures in the Job output∙ automated configuration snapshot-ing for Production
12
jan 2014 - infrastructure improvements
Relatively large number of Dev environments
∙ Created and maintained manually via CloudFormation∙ Configurations started drifting∙ It became tedious to update them in case of a needed masschange
∙ The clouds tool was written during a ’Research Week’∙ makes it so much easier to manage diverging stacks∙ released on Github as GPL2∙ can be gem install-ed
13
aug 2014 - alpha release
Released to hundreds of selected preview users
∙ Capacity planning&load tests, all looked great∙ Architecture remained almost the same
∙ added ElastiCache(memcached) as shared temporary storage∙ worked around SQS limitations: split queues by environment
∙ Slow loading performance reports, triggered some actions∙ started using NewRelic for Real User Monitoring(RUM)∙ implemented WebPageTest(WPT) automation in our CI
14
oct 2014 - beta release
Opt-in release from the legacy website
∙ Beta invites implemented using SES∙ Thousands of users world-wide∙ More capacity planning∙ Added CloudFront CDN for static files
15
oct 2014 - beta release
CloudFront setup details
∙ S3 bucket as origin∙ Dev/prod S3 bucket sync, IAM cross-account bucket policy∙ Noticed worse performance in NewRelic, WTH?∙ CloudFront limitation: won’t compress content
∙ explicit gzip compression needed, scripted at build time∙ upload already compressed files to S3∙ only compress the files when it helps (>1KB size reduction)
∙ Required HTTP headers, set as S3 object metadata∙ MIME type∙ gzip encoding∙ caching duration (we use half a year by default)
17
oct 2014 - beta release
File path conventions
∙ File paths depend on the file content:/static_content/path/to/file.css_d34db33f
∙ ’d34db33f’ is the result ofsha256(plain_file_content)[0..7]
∙ path translation table∙ all files under one directory for easy filtering later∙ intentionally decoupled from what’s deployed on EC2∙ idempotent content updates
18
oct 2014 - beta release
Still single region
∙ Limitation of our custom continuous deployment automationwas fixed, but it was too late
∙ Initial test results∙ CloudFront static file caching would hide this well enough∙ NewRelic and WebPageTest results deemed acceptable
19
dec 2014 - launch
All traffic from the legacy environment (HTTP redirect)
∙ Millions of users world-wide, more capacity planning needed∙ Extended CloudFront, now also used for dynamic content∙ Decided to implement dynamic-CloudFront before multi-region,more benefits for little extra costs
∙ OCSP Stapling - no more extra blocking call to your CA: 80-400mssaving
∙ early TCP termination: 50-500ms saving∙ long-living connections between CloudFront and ELB
∙ HTTP redirects to HTTPS: 50-500ms saving for plain HTTP users∙ Browsers: one less domain to resolve, less TCP connections tomaintain, less CPU usage
21
jan 2015 - multi-region
First expansion attempt
∙ Latency-based routing with Route53, really straightforward∙ No other architecture changes were needed∙ Deployed to Singapore and Frankfurt in addition to existingVirginia
∙ Soon realized that Frankfurt was broken a bit ’special’ :-)∙ different way to define ElastiCache SGs (VPC-only region)∙ ElastiCache was not yet supported by CloudFormation there
23
jan 2015 - multi-region
With Singapore added, we noticed almost no performanceimprovement - WTH?
∙ Investigation immediately revealed NewRelic setup errors∙ incorrectly included in HTML∙ we were missing metrics from the slowest clients! :-(
∙ Fixed the NewRelic configuration∙ noticed how slow we really were in most geographies
24
jan 2015 multi-region
Investigating the lack of performance improvements
∙ Backend performance issues in Singapore∙ Only shifting network latency, not overcoming it
∙ Root cause: some APIs we depend on when rendering HTMLwere deployed in remote regions
25
jan 2015 - multi-region
Speeding up Singapore
∙ Avoid blocking API calls from the landing page∙ replaced one with a local GeoIP database, removed another∙ backend performance improved 50x
26
apr-may 2015 - performance issues
Loading performance was lagging behind our competitors
∙ They improved significantly∙ We got many new users from emerging markets∙ Visible in user feedback and bounce rates∙ Had to take some actions
27
apr-may 2015 - magellan
Our current ways of working, Magellan, set up in Jan 2015
∙ Self-organizing, temporary, cross-functional teams mandated bymanagement to increase a metric
∙ Bottom-up innovation∙ everyone chooses their team∙ design, implementation and release is team’s responsibility∙ management reviews the progress and provides some advice
∙ First iteration (Jan - Apr): post-launch usability improvements∙ Second iteration: tech debt and performance fixes
28
apr-may 2015 - magellan
Improving our performance
∙ Goal of one of the teams∙ bring load performance back on par with the competition
∙ Actions that were taken∙ finally launched Frankfurt(fixed in the meantime)∙ also Sydney and California∙ refactored our CloudFormation stacks (now all identical)∙ instances were right-sized∙ devs heavily optimized the application for faster loading
∙ DevOps at its best
29
apr-may 2015 - magellan
Results
∙ Visual progress now comparable to Google maps ourcompetition :-)
∙ Global loading time average reduced by about a second
∙ Lots of improvement ideas were added to the backlog∙ More fixes to be implemented soon 30
next steps
More performance improvements
∙ Fix some remaining bugs∙ we’d finish loading 2-3 seconds earlier∙ but minimal visual progress changes
∙ SPDY HTTP2 on CloudFront∙ AWS has to implement it∙ eventual application changes∙ reverse proxy through CloudFront some of our client APIs
32
conclusions
In no particular order
∙ Start small∙ Iterate continuously∙ Be data-driven in decision making (A/B, user feedback, RUM,WPT)
∙ Not all AWS regions are (born) equal∙ Expect and embrace AWS limitations∙ Workarounds sometimes lead to bigger improvements (cachebusting, clouds)
∙ CloudFront is excellent at HTTPS website acceleration, use it!∙ Automate anything that bothers you∙ DevOps FTW!
34
references and credits
Resources
∙ Clouds on GitHub https://github.com/cristim/clouds∙ Any used logos and images are © of their respective authors
36