Keith Adams -- [email protected] June 20, 2018 · Keith Adams -- [email protected] GOTO Amsterdam...
Transcript of Keith Adams -- [email protected] June 20, 2018 · Keith Adams -- [email protected] GOTO Amsterdam...
![Page 1: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/1.jpg)
![Page 3: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/3.jpg)
Some impossibility results
Two case studies
Takeaways
How Slack works even though it can’t
Introduction
![Page 4: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/4.jpg)
“”
![Page 5: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/5.jpg)
Like IRC?
Only visually.
● IRC is defined by its ephemerality
● Slack offers persistence
● Like a hybrid of e-mail and IRC
![Page 6: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/6.jpg)
Slack Technical Constraints
Minimal Behavior of a Channel
● Validity/Agreement: if a member sends/receives a message, all members
will eventually receive it.
● Integrity: a message is received by each member at most once, and only if it
was previously sent
● Total Order: all members receive messages in the same order
![Page 7: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/7.jpg)
Atomic Broadcast Definition
● Validity/Agreement: if a member sends/receives a message, all members
will eventually receive it.
● Integrity: a message is received by each member at most once, and only if it
was previously sent
● Total Order: all members receive messages in the same order
![Page 8: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/8.jpg)
Uh-oh.
● Atomic broadcast is equivalent to consensus[1]
● Consensus in general is impossible[2]
[1] Chandra and Toueg. Unreliable failure detectors for reliable distributed systems. JACM
43(2):225–267, 1996.
[2] Fischer, Lynch, and Paterson. Impossibility of Distributed Consensus with One Faulty
Process. JACM 32(2):374-382, 1985.
![Page 9: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/9.jpg)
So … are we done here?
RIP Slack2014-2018
“Useful until proven impossible”
![Page 10: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/10.jpg)
Of course not!
● There are practically useful consensus systems despite FLP
● : Relax constraints
● Cryptocurrencies: probabilistic log
● Paxos/ZAB/Raft/...: might not terminate
![Page 11: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/11.jpg)
Scaling Impossible Things
● What constraints to relax is an end-to-end property[1] of the system
● Varies by application, its parameters
● Complexity is inherent
● Our solution keeps changing with app, scale, user behavior, hardware
economics, ...
J. H. Saltzer, D. P. Reed, D. D. Clark, “End-to-End Arguments in System Design,” 2nd
International Conference on Distributed Computing Systems, Paris, (April 1981), pp. 509-512.
![Page 12: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/12.jpg)
Case Study #1: Message Send/Receive
![Page 13: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/13.jpg)
Slack Cartoon
MySQL
webapp
Channel Server
![Page 14: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/14.jpg)
Channel Server
Real-time service, accessed over WebSockets.
● Push updates to clients
● Messages, typing indicators, presence
● Witness to order of messages
● Grab-bag of other roles
Division of Labor
WebApp
>1MLoC Hacklang monolith. Medium levels of SOA-osity.
● CRUD
● Storage
● Retrieval, permissions
● Session establishment
![Page 15: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/15.jpg)
Send/receive for Online Clients
Client B
Channel Server
WebApp
Client A
![Page 16: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/16.jpg)
Client Sends to CS
Client B
Channel Server
WebApp
Client A
![Page 17: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/17.jpg)
CS Amplifies, then Acks
Client B
Channel Server
WebApp
Client A
![Page 18: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/18.jpg)
End of User-Perceived Latency
Client B
Channel Server
WebApp
Client A
![Page 19: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/19.jpg)
Store Message in DB
Client B
Channel Server
WebApp
Client A
![Page 20: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/20.jpg)
The Happy Path
● Latency of WebApp, DB writes hidden from users
● But what if something goes wrong?
![Page 21: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/21.jpg)
CS Crash!
Client B
Channel Server
WebApp
Client A
![Page 22: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/22.jpg)
WebApp Outage
Client B
Channel Server
WebApp
Client A
![Page 23: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/23.jpg)
Dealing with Failures
● CS maintains an on-disk buffer of uncommitted sends
● Replayed when recovering from CS crash
● Retried while webapp is unavailable
● State
○ Complexity
○ Risk during CS code changes
○ But provides partial end-to-end utility while site is hard-down
![Page 24: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/24.jpg)
Changes since 2014
● Webapp more stable
● Job queue more stable and scalable
○ Safe way of deferring work
○ See Saroj, Matt, Mike, and Tyler’s blog post
![Page 25: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/25.jpg)
New Send Flow
Client B
Channel Server
WebApp
Client A
JobQueue
![Page 26: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/26.jpg)
Defer Slow Work
Client B
Channel Server
WebApp
Client A
JobQueue
![Page 27: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/27.jpg)
Send Real-Time Updates
Client B
Channel Server
WebApp
Client A
JobQueue
![Page 28: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/28.jpg)
End of user-perceived latency
Client B
Channel Server
WebApp
Client A
JobQueue
![Page 29: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/29.jpg)
HTTP 200, deferred work
Client B
Channel Server
WebApp
Client A
JobQueue
![Page 30: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/30.jpg)
New Flow Observations
● Crash-safe
● Low latency by deferring costly parts
● Stateless-ish CS now possible
● Clients can send without establishing a web socket session
![Page 31: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/31.jpg)
So this way is better, right?
● In 2018, yes
○ Still rolling out to all geographies, teams
● But it definitely wasn’t in 2014
○ Extra hop between clients
○ Webapp was less available
○ JobQueue was finite capacity
![Page 32: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/32.jpg)
Case Study #2: WebSocket initiation
![Page 33: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/33.jpg)
Slack is Connection-Oriented
● Most of our community’s scaling experience is request-oriented
● Slack: Server-push via WebSockets
● > 5M simultaneous sessions at peak, with wide peak-to-trough variations
![Page 34: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/34.jpg)
Classic Session Establishment Pattern
● Invoke rtm.start API method
● Use wss:// url in results to start session
![Page 35: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/35.jpg)
WebApp harvests team data
WebApp
Client A
Channel Servers
![Page 36: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/36.jpg)
WebApp delivers huge payload
WebApp
Client A
Channel Servers
![Page 37: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/37.jpg)
Establishing WS connection (done)
WebApp
Client A
Channel Servers
![Page 38: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/38.jpg)
Rtm.start payload
● “Keyframe” of team state
● Users, profiles, channels and membership, latest-modified timestamps for
channels, logged-in users’ last-read timestamps, ...
● Incremental updates via WebSocket
![Page 39: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/39.jpg)
Great in 2014!
● This worked great for small teams
● …close to Slack’s datacenter
● As organizations surpassed 1000, then 10,000, then 100,000
● ...and spread across the globe...
![Page 40: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/40.jpg)
Problems
1. Rtm.start payload size. (Performance)
2. Connection storms place redundant load on databases. (Reliability)
3. Round-trip times for most of the world. (Performance)
![Page 41: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/41.jpg)
Slack’s Solution: Flannel
● Stateful, Application-aware Microservice
● Pre-warmed cache of teams, channels, users, ...
● Terminates websockets
● Runs in edge regions, reducing load on core and improving service time
● See Bing Wei’s blog post and talk for more details
![Page 42: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/42.jpg)
Flannel
webapp
Channel Server
Flannel Flannel
Paris Singapore US East
![Page 43: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/43.jpg)
Establishing Session Flannel-Style
Flannel
Client A
WebApp
![Page 44: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/44.jpg)
So Flannel is better, right?
● Yes!
● Simpler, safer, faster
● But no way to foresee this before reaching this scale
● Next scale might change
![Page 45: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/45.jpg)
Takeaways
● Find the end-to-end part of your problem
● Optimality is contingent, and changes with growth
● Simplicity misapplied is just as poisonous as complexity
![Page 46: Keith Adams -- kma@slack-corp.com June 20, 2018 · Keith Adams -- kma@slack-corp.com GOTO Amsterdam 2018 June 20, 2018 2 Scaling Slack. Some impossibility results Two case studies](https://reader030.fdocuments.net/reader030/viewer/2022041014/5ec47015829523303e0b7ec3/html5/thumbnails/46.jpg)