Milena Talavera Senior Infrastructure Manager@Slack
Transcript of Milena Talavera Senior Infrastructure Manager@Slack
Flannel Slack’s Secret to Scale
Milena Talavera Senior Infrastructure Manager@Slack
Make a Copy of this deck (File > Make a copy…) when creating your own. This will preserve the design styles.
Things don’t need to be HUGE. Most presentations are seen full-screen or even projected quite large at an event so let’s keep things looking professional and to a modest size.
Less is more! Keep slides simple and provide helpful notes.
Our Mission: To make people’s working lives simpler, more pleasant, and more productive. t force others to read them.
Slack Scale
❖ 8M+ Daily Active Users 3M+ paid users; 65% of Fortune 100
Companies ❖ 100+ countries 50%+ of DAU outside of US
From supporting small teams 3-4 years ago To serving gigantic organizations of hundreds of thousands of users today
Slack Scale
To support such rapid growth of yesterday and today, Slack’s Infrastructure has to get ahead of customer growth
Biggest Teams
2015 8,000 users
Biggest Teams
2015 8,000 users
2016 26,000 users
Biggest Teams
2015 8,000 users
2016 26,000 users
2018 266,000 users
Slack Architecture History Lesson
Fat, greedy client
Fat, lazy client
Flannel Powered
Lazy + Flannel Powered
Resiliency
Scale: Pub/Sub
Slack Architecture History Lesson
2015 2017 2018
Fat, greedy client
Fat, lazy client
Flannel Powered
Lazy + Flannel Powered
Resiliency
Scale: Pub/Sub
Slack Architecture History Lesson
2015 2017 2018
Fat, Greedy Client
WebApp PHP/Hack
Messaging Server Java
HTTP
WebSocket
MySql
User Connect Flow in 2015
Client Server
2. HTTP response: a snapshot of the team
3.Long-lived WebSocket connection
real time events
1. https://slack.com/api/rtm.start
Connect
time
User Connect Flow in 2015
Advantages ○ Every Slack Object available locally on
the client ○ User experience was super speedy ○ Enabled us to move fast
User Connect Flow in 2015
Limitations ○ Expensive connection/reconnection ○ Large client memory footprint (grows with
team size) ○ Susceptible to thundering herd
19
Number of users Number of channels Snapshot size (bytes)
30 10 200K
500 200 2.5M
3,000 7,000 20M
30,000 1,000 60M
Team Snapshot Size
Max Team Sizes in 2015: ~8,000 users
Fat, greedy client
Fat, lazy client
Flannel Powered
Lazy + Flannel Powered
Resiliency
Scale: Pub/Sub
Slack Architecture History Lesson
2015 2017 2018
User Connect Flow in 2015
Client Server
2. HTTP response: a snapshot of the team
3.Long-lived WebSocket connection
real time events
1. https://slack.com/api/rtm.start
Connect
time
User Connect Flow in 2016
Client Server
2. HTTP response: a partial snapshot of the
objects
3.Long-lived WebSocket connection
Pruned real time events
1. https://slack.com/api/rtm.start
Connect
time
4.Asynchronous fetch of non essential objects
User Connect Flow in 2016
Incremental Improvements ○ Load less data at client boot time ○ Parallelized, lazy loading on demand ○ Simplified objects
On a 10,000 user team, these change alone saved a few megabyte of data.
User Connect Flow in 2016
Still Limitations ○ Still Susceptible to thundering herd if
clients dump their cache ○ Still grows with team size
Fat, greedy client
Fat, lazy client
Powered by Flannel
Lazy + Powered by Flannel
Resiliency
Scale: Client Pub/Sub
Slack Architecture History Lesson
2015 2017 2018
Flannel Powered Slack
Flannel: Slack’s edge cache service
○ A query engine backed by cache on edge locations
Powered by Flannel
WebApp PHP/Hack
Messaging Server Java
MySql Cache
Edge Pops Client Edges Non edge locations
1. WebSocket connection
2. HTTP Post: download a snapshot of the team
3. WebSocket: Stream Json events to keep cache updated
Flannel Deployment Architecture
HAProxy
Flannel
Edge Region C
Team Affinity
Flannel Flanne
l
Flannel
HAProxy
Flannel
Edge Region B
Team Affinity
Flannel Flanne
l
Flannel
GeoDNS
Client
HAProxy
Flannel
Team Affinity
Flannel Flanne
l
Flannel
Edge Region A
Flannel Powered Slack 2017
Advantages ○ Clients have low latency access to key
big objects through edge/pop regions ○ Minimal client changes were needed to
implement ○ More query flexibility and filtering than
typical cache solutions like memcache
Features Powered by Flannel
Quick Switcher
Features Powered by Flannel
Mention Suggestions
Features Powered by Flannel
Channel Header
Features Powered by Flannel
Channel Sidebar
Flannel Powered Slack 2017
Limitations ○ Keeping Flannel cache updated is
expensive (firehose feed of events) ○ Thundering herd phenomenon is still a
possibility ○ Cache on the websocket is in the critical
path
Fat, greedy client
Fat, lazy client
Powered by Flannel
Lazy + Powered by Flannel
Resiliency
Scale: Client Pub/Sub
Slack Architecture History Lesson
2015 2017 2018
Powered by Flannel V1.5
Impactful Improvements ○ Thrift Pub/Sub reducing number of
events processed by 1000X
Powered by Flannel V1.5
WebApp PHP/Hack
Messaging Server Java
MySql Cache
Edge Pops Client Edges Non edge locations
1. WebSocket connection
2. HTTP Post: download a snapshot of the team
3. Pub Sub Thrift events to keep cache updated
Before
After
events reduce by
Powered by Flannel V1.5
Impactful Improvements ○ Client lazily loads primary objects (users,
channels, channel membership) significantly reducing boot time
Max Team Sizes in 2018: ~266,000 users
Fat, greedy client
Fat, lazy client
Powered by Flannel
Lazy + Powered by Flannel
Resiliency
Scale: Client Pub/Sub
Slack Architecture History Lesson
2015 2017 2018
Resiliency
As scale increases, failures are more likely to happen. Our goal is to minimize blast radius and recovery time of failure modes.
Resiliency
Our observation: when failures happen, they happen faster than one can blink an eye. Solution to this can not rely on human intervention
Resiliency
+ =
Automated Admission Control
Resiliency
Measures Taken ○ Automated Admission Control based on
various metrics. Examples: memory pressure, concurrent requests, etc
Automated Admission Control
Resiliency
Circuit Breakers
Resiliency
Measures Taken ○ Built in Circuit Breakers to mitigate
cascading failures and protect services from each other’s bad behaviours
Circuit Breakers
What Else
Regional Failover Auto Scaling
Fat, greedy client
Fat, lazy client
Powered by Flannel
Lazy + Powered by Flannel
Resiliency
Scale: Client Pub/Sub
Slack Architecture History Lesson
2015 2017 2018
Sneak Peak into the Future
Expand Pub/Sub to Client Side ○ Reduce events clients have to handle ○ Track what is in the current view ○ Subscribe/Unsubscribe to events when
view changes
THANK YOU! Got Questions?
Milena