Chartered Institute of Management Accountants recruitment - Jo Monkman
Dark launching with Consul at Hootsuite - Bill Monkman
-
Upload
datawire -
Category
Technology
-
view
365 -
download
2
Transcript of Dark launching with Consul at Hootsuite - Bill Monkman
Dark Launching with Consul
Senior Specialist Engineer@bmonkman
Bill Monkman
• The most widely used platform for managing social media• Integrates with Twitter, Facebook, Instagram, LinkedIn, G+, etc.• Started 7 years ago, now over 10 million users• Used by over 800 of the Fortune 1000
Hootsuite
• Everything in Amazon AWS (low thousands of servers)• Primary languages PHP, Scala, Python, Go• 10+ releases to production per day• 20+ Microservices• Started using Consul in late 2014• Also using Vagrant, Packer, Terraform, Vault
Hootsuite
• Deployed in all datacenters (AWS Regions) in staging and prod• Clusters of 3-5 servers (Multi-AZ)• Consul agent installed on almost every server• First use: Dark Launching
Consul at Hootsuite
• AKA Feature Flagging, Feature Toggle, etc.• Allow dynamic control of your systems in real time• Used extensively at Facebook, Etsy, Flickr, others• Integrated with all the languages we use, both front and back-end• Very powerful tool for continuous delivery• Key to engineers at HS pushing code quickly and confidently• Allowed even other departments to control the system (Support,
Marketing)
Dark Launching
Dark Launching
Various restriction types:• boolean• percentage_static• percentage_random• user_list• organization_list• plan_code• language• webserver• etc.
Dark Launching
Use Cases
Typical
Push new code then:
● Dark launch to yourself or your team to test● Launch to the whole Hootsuite organization● 10% of all users● Watch graphs● 50%● 100%● Simple means of rollback if necessary
Use Cases
Migration
● Controlled migration to new services● Phased rollouts● Allowing beta group of users to try new features ahead of full
release
Use Cases
Load Testing
● When creating a new feature or service, send partial traffic to it, slowly ramp up
● Shadow reads/writes
Use Cases
Security / Protection
● “Kill twitter streams” flag● Attack mitigation
Use Cases
A/B Testing
● Test a feature to half the user base to gauge impact/adoption● Try to limit it to simple tests. Anything more complex needs a real
A/B framework
Wrap code in a dark launch block
Newly added flags will be automatically registered in the KV store the first time the code executes (with some stampede protection)
Dark Launching at Hootsuite
Managed via a web interface
(screenshot)
Dark Launching at Hootsuite
Managed via a web interface
(screenshot)
Dark Launching at Hootsuite
• Has become core to our continuous delivery workflow• Changed the way we use source control• Branching in production• Comes with some associated costs - cleanup / complexity
Dark Launching at Hootsuite
Web Server
Memcached
Web ServerInitial implementation
Dark Launching at HootsuiteWeb Server
Memcached
PHP-FPMPHP-FPMPHP-FPM
MemcachedMySQL
Problems with the old way
● As Dark Launching became important to our process, usage skyrocketed● Initial implementation with Mysql and Memcached ran into various issues
○ Hot cache keys○ Too tied in to our core dashboard ○ Not suitable for a distributed system (move to microservices)
● Outages!
Enter Consul
● Fans of Hashicorp products already● Saw potential for a “push” based solution to dark launch management● Wanted to explore it for other uses, this was a useful test ground● Evaluated a few tools, and though Consul was fairly bleeding-edge, we
liked the feature set and direction of it and had faith in the team behind it.● Based on well known algorithms/protocols (RAFT and SWIM)● Started experimenting with a small-scale deployment
Implementation
Base data stored in Consul KV store (with metadata in MongoDB)
Implementation
Watch added using Ansible, baked into image
Implementation (PHP)
● Handler that receives all KV data for a project● Writes out a PHP syntax config file with all data as an array● Hits webserver on localhost to clear APC cache (in-memory cache)● PHP code then checks cache, reloads from file if missing and does a KV lookup on
the array of dark launch data● If the checked flag does not exist in the data, communicate with the local consul
agent to add it.
Implementation (PHP)<?php
$dlCodes =
array (
'ACCOUNT_CURRENCY_TOGGLE' => array (
'value' => 0,
'restriction' => 'boolean',
'isAvailableToJs' => 0,
'createdDate' => '2015-09-28 00:12:34',
),
...
);
Web ServerWeb Server
Modifying a flagWeb Server
PHP-FPMPHP-FPMPHP-FPM
Consul Agent
Consul ServerConsul ServerConsul Server
DL Config
1 2
34
5
Consul Agent
Web ServerWeb Server
Creating a flagWeb Server
PHP-FPMPHP-FPMPHP-FPM
Consul Agent
Consul ServerConsul ServerConsul Server
DL Config
4 3
2
1
Consul Agent
Implementation (Scala)
● Handler that receives all KV data for a project● Writes out a Typesafe HOCON syntax config file with all data as a list● Uses inotify to watch for changes to the file● Scala code asks the actor for data for a specific dark launch code● Uses an Akka Agent (a construct which just manages state)
Implementation (Containers)
● We use Mesos / Marathon to schedule long-running services written in Scala and Go
● Similar to previous implementations.● Consul runs on the mesos slave host, writes all service dark launch data to
disk● Shared between all containers on the host
Problems
● Multi-DC setup was hampered until Consul 0.5.1 due to lack of distinct LAN/WAN advertise addresses
● Atomicity - Convergence is slower than atomic memcached change, though it’s not a problem for our usage of dark launching (typical convergence is within 1 second)
Convergence
1 second
Lessons Learned
● Enable ACLs early, plan your usage of ACLs● Put enough thought into your KV store structure● You may need to bribe your security team to convince them that having bi-
directional communication between all nodes on specific ports is okay● It’s important to understand Consul’s outage recovery process and
document what to do in the unlikely event that all servers fail.● Key prefix type events will be delivered even to nodes that were down at
the time of the event
Conclusions
● Consul worked well for us right from the start (~0.4.0)● Making an existing, valuable system better was a great way to introduce it to the
company, making its adoption much more smooth● Using it for many other projects now
○ Nginx LB configuration based on auto-scaling web servers○ Service discovery for seeding Akka Cluster○ Distributed locking for various purposes○ Microservice Discovery and routing system (Skyline)
● Seamless upgrade process
Conclusions
● Increased stability and decreased load on Memcached / MySQL● Since data is now pushed rather than pulled, the system can still read dark launch
data independently of the state of the data store.● Now usable in all DCs, projects and environments● Shared state allows us to coordinate changes between microservices