Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good,...

60
Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead [email protected] | @mjdemmer

Transcript of Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good,...

Page 1: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Michael Demmer

November 6, 2018

Scaling SlackThe Good, The Unexpected, and The Road Ahead

[email protected] | @mjdemmer

Page 2: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Me

Page 3: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

(Not) This Talk

1. 2016: Monolith

2. 2016-2018: Microservices

3. 2016-2018: Best Practices

4. 2018: Lessons Learned

Page 4: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

This Talk

1. 2016: How Slack Worked

2. 2016-2018: Things Got More Interesting

3. 2016-2018: What We Did About It

4. 2018+: Themes and Road Ahead

Page 5: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Slack in 2016

Page 6: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Slack

Page 7: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Workspaces, Channels, Users, and more

Duff Beer

Oceanic Airlines

Delos

A workspace logically contains all channels and messages, as well as users, emoji, bots, and more. All interactions occur within the workspace boundary.

us_east_1

AcmeCorp

#brainstorming#proj-roadrunner#marketing…

@alice@bob@carol...

Page 8: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

User Base4M Daily Active Users

Largest Organizations>10,000 Active Users

Connectivity2.5M peak simultaneous connectedAvg 10 hrs/day

Engineering StyleConservative, Pragmatic, MinimalMost systems > 10 year old technology

Slack Facts (2016)

Page 9: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

us_east_1

How Slack Works (2016)

RTM ServiceRTM ServiceMessage Server(Java)

WebappWebappWebapp(PHP)

RTM Service

RTM ServiceMessage

Proxy

us_west_1

Client

Websocket

HTTP API Calls

Job Queue

MySQLMySQL

Page 10: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Client / Server Flow

Initial login:

● Download full workspace model with all channels, users, emoji, etc.

● Establish real time websocket

Webapp(PHP)

Message Proxy

1: rtm.start

2: prefs: {...}, users: {...}, channels: {...}, emoji: {...}, ms: “ms1.slack-msgs.com”

3: web

socket

conne

ct

Page 11: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Client / Server Flow

Initial login:

● Download full workspace model with all channels, users, emoji, etc.

● Establish real time websocket

While connected:

● Push updates via websocket

● API calls for channel history, message edits, create channels, etc.

Webapp(PHP)

Message Proxy

reactions.add

{messa

ge: ..

.}

Page 12: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Sharding And Routing

Workspace Sharding

● Assign a workspace to a DB and MS shard at creation

● Metadata table lookup for each API request to route

Mains

RTM Service

RTM ServiceMessageServers

MySQLShards

Webapp(PHP)

select * f

rom teams

where id

=1234

{id:1234

, domain

:demmer,

db_shar

d:35, ms

_shard:1

1, ...}

Page 13: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Sharding And Routing

Workspace Sharding

● Assign a workspace to a DB and MS shard at creation

● Metadata table lookup for each API request to route

“Herd of Pets”

● DBs run in active/active pairs with application failover

● Service hosts are addressed in config and manually replaced

Mains

RTM Service

RTM ServiceMessageServers

MySQLShards

Webapp(PHP)

Page 14: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Server Experience

Implementation model is straightforward, easy to reason about and debug.

● All operations are workspace scoped

● Horizontally scale by adding servers

● Few components or dependencies

Why This Worked

Client Experience

Data model lends itself to a seamless, rich real-time client experience.

● Full data model available in memory

● Updates appear instantly

● Everything feels real time

Page 15: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Things Get More Interesting...

Page 16: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Things Get More Interesting

Product Model

Size and Scale

Page 17: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Slack Growth

Page 18: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

User Base>8M Daily Active Users

Largest Organizations>125,000 Active Users

Connectivity>7M peak simultaneous connectedAvg 10 hrs/day

Engineering StyleStill pragmatic, but embrace complexity where needed to solve hardest problems

Slack Facts (2018)

Page 19: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

User Base>8M Daily Active Users

Largest Organizations>125,000 Active Users

Connectivity>7M peak simultaneous connectedAvg 10 hrs/day

Engineering StyleStill pragmatic, but embrace complexity where needed to solve hardest problems

Slack Facts (2018)

2x10x !

3x

Page 20: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Change the Model

Duff Beer

Oceanic Airlines

Delos

A workspace logically contains all channels and messages, as well as users, emoji, bots, and more. All interactions occur within the workspace boundary.

us_east_1

AcmeCorp

#brainstorming#proj-roadrunner#marketing…

@alice@bob@carol...

Page 21: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Change the Model

AcmeCorp

Duff Beer

Oceanic Airlines

Delos

Wayne Enterprises

Wayne Shipping

Wayne Finance

Wayne Security

EnterpriseWorkspaces

Page 22: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Change the Model

AcmeCorp

Duff Beer

Oceanic Airlines

Agents of SHIELD

Stark Industries

Delos

Wayne Enterprises

Wayne Shipping

Wayne Finance

Wayne Security

Shared ChannelsWorkspaces Enterprise

Page 23: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Challenges

Recurring Issues

● Large organizations: Boot metadata download is slow and expensive

● Thundering Herd: Load to connect >> Load in steady state

● Hot spots: Overwhelm database hosts (mains and shards) and other systems

● Herd of Pets: Manual operation to replace specific servers

● Cross Workspace Channels: Need to change assumptions about partitioning

Page 24: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

So What Did We Do?

Page 25: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

What Did We Do

Message Services

Service Decomposition

Vitess

Fine-GrainedDB Sharding

Thin ClientModel

Flannel Cache

Page 26: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

What Did We Do

Thin ClientModel

Flannel Cache

Page 27: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Challenge: Boot Model Explosion

boot_payload_size ~= (num_users * user_profile_bytes) + (num_channels * (channel_info_size + (num_users_in_channel * user_id bytes)))

Users Profiles Channels Total

12 6 KB 1 KB 7 KB

530 140 KB 28 KB 168 KB

4,008 5 MB 2 MB 7 MB

Page 28: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Challenge: Boot Model Explosion

boot_payload_size ~= (num_users * user_profile_bytes) + (num_channels * (channel_info_size + (num_users_in_channel * user_id bytes)))

Users Profiles Channels Total

12 6 KB 1 KB 7 KB

530 140 KB 28 KB 168 KB

4,008 5 MB 2 MB 7 MB

44,030 36 MB 25 MB 59 MB

148,170 78 MB 40 MB 118 MB

Page 29: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

us_east_1

Thin Client Model

RTM ServiceRTM ServiceMessage Server

WebappWebappWebapp

RTM Service

RTM ServiceMessage

Proxy

Client

Websocket

HTTP API Calls

Job Queue

MySQLMySQL

us_west_1

Page 30: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

RTM Service

RTM ServiceFlannelCache

us_west_1

us_east_1

Thin Client Model

RTM ServiceRTM ServiceMessage Server

WebappWebappWebapp

RTM Service

RTM ServiceMessage

Proxy

us_west_1

Client

Websocket

HTTP API Calls

Job Queue

MySQLMySQL

Consul

Page 31: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Thin Client Model

RTM Service

RTM ServiceFlannel

Flannel ServiceGlobally distributed edge cache

Minimize Workspace Model Much smaller boot payload

RoutingWorkspace affinity for cache locality

Query APIFetch unknown objects from cache

Cache UpdatesProxy subscription messages to clients

Websocket

Page 32: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Thin Client Model

Unblock Large Organizations

Adapting clients to a lazy load model was a critical change to enable Slack for large organizations.

● Huge reduction in payload times on initial connect

● Flannel efficiently responds to > 1+ million queries per second

● Adds challenges of cache coherency and reconciling business logic

Page 33: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

What Did We Do

Vitess

Fine-GrainedDB Sharding

Page 34: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Challenge: Hot Spots & Manual Repair

Page 35: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

RTM Service

RTM ServiceFlannel Cache

us_west_1

us_east_1

Vitess

RTM ServiceRTM ServiceMessage Server

WebappWebappWebapp

RTM Service

RTM ServiceMessage

Proxy

us_west_1

Client

Websocket

HTTP API Calls

Job Queue

MySQLMySQL

Consul

Page 36: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

RTM Service

RTM Service

us_west_1

us_east_1

Vitess

RTM ServiceRTM ServiceMessage Server

WebappWebappWebapp

RTM Service

RTM ServiceMessage

Proxy

us_west_1

Client

Websocket

HTTP API Calls

Job Queue

MySQLMySQL

VtTablet MySQL

VtGateVtGateVtGate

Flannel Cache

Consul

Page 37: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Vitess

VtTablet MySQL

VtGateVtGateVtGate

Flexible ShardingVitess manages per-table sharding policy

Topology ManagementDatabase servers self-register

Single MasterUsing GTID and semi-sync replication

FailoverOrchestrator promotes a replica on failover

Resharding WorkflowsAutomatically expand the cluster

WebappWebappWebapp

Page 38: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Vitess

Fine-Grained Sharding

Migrating to a channel-sharded / user-sharded data model helps mitigate hot spots for large teams and thundering herds.

● Retains MySQL at the core for developer / operations continuity

● More mature topology management and cluster expansion systems

● Data migrations that change the sharding model take a long time

Page 39: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

What Did We Do

Message Services

Service Decomposition

Page 40: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Challenge: Shared Channels?

Agents of SHIELD

Stark Industries

Message Server

Message Server

Page 41: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Challenge: Shared Channels?

Agents of SHIELD

Stark Industries

Message Server

Message Server

Page 42: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

RTM Service

RTM ServiceFlannel Cache

us_west_1

us_east_1

Message Server to Services

RTM ServiceRTM ServiceMessage Server

WebappWebappWebapp

RTM Service

RTM ServiceMessage

Proxy

us_west_1

Client

Websocket

HTTP API Calls

Job Queue

MySQLMySQL

VtTablet MySQL

VtGateVtGateVtGate

Consul

Page 43: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

us_east_1

Message Server to Services

ClientWebappWebappWebapp

Job Queue

VtTablet MySQL

VtGateVtGateVtGate

RTM Service

RTM ServiceChannel Server

RTM Service

RTM ServiceGateway

Server RTM Service

RTM ServicePresence

ServerRTM

ServiceRTM

ServiceMessage

ServerVtGateVtGateAdmin

Server

RTM Service

RTM Service

us_west_1

Consul

MySQLMySQL

Flannel Cache

Websocket

HTTP API Calls

Page 44: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Gateway ServerWebsocket termination and subscriptions

Admin ServerCluster management and routing

Presence ServerStore and distribute presence state

Channel ServerPub/Sub fanout with 5 minute buffering

Message Server to Services

(Legacy) Message ServerUsed for reminders, Google Calendar integration

Channel Server

Gateway Server

Presence Server

Message Server

Admin Server

Page 45: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Message Server to Services

Generic Messaging Services

Everything is a pub/sub “channel”, including message channels as well as workspace / user metadata channels.

● Clients / Flannel subscribes to updates for all relevant objects

● Each Message Service has dedicated clear roles and responsibilities

● Self-healing cluster orchestration to maintain availability

● Each user session now depends on many more servers being available

Page 46: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

What Did We Do

Message Services

Service Decomposition

Vitess

Fine-GrainedDB Sharding

Lazy Client

Flannel Cache

Page 47: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Some Themes...

Page 48: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Topology Management

For each of these projects (and more), architecture evolved from hand-configured server hostnames to a discovery mesh.

● Enables self-registration and automatic cluster repair

● Adds reliance on service discovery infrastructure (consul)

● Led to changes in service ownership and on-call rotation

Herd of Pets to Service Mesh

Page 49: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Scatter May Be Harmful

Fine-Grained Sharding

Migrating from a workspace-scope to channel or user scoped spreads out the load but adds a requirement to sometimes scatter/gather.

● Removes artificial couplings on back end systems

● Teams are less isolated, so need extra protections from noisy neighbors

● When scattering, clients should tolerate partial results and retry

● Tail latencies can dominate performance when fetching from many

Page 50: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Deprecation Challenges

As hard as it is to add new services into production under load, it’s proven as hard if not harder to remove old ones.

● With few exceptions, all 2016 services still in production

● Need to support legacy clients and integrations

● Data migrations need application changes takes time

Deploying Is Only The Beginning

Page 51: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Performance Short Game

Architectural rework is necessary, but less glamorous performance optimizations pay huge dividends

● Simple approaches to caching or refactoring

● Client-side jitter to spread out load

● Eliminate unnecessary methods / queries

Grinding It Out

Page 52: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

us_east_1

How Slack Works (2016)

RTM ServiceRTM ServiceMessage Server(Java)

WebappWebappWebapp(PHP)

RTM Service

RTM ServiceMessage

Proxy

us_west_1

Client

Websocket

HTTP API Calls

Job Queue

MySQLMySQL

Page 53: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

us_east_1

How Slack Works (2018)

ClientWebappWebappWebapp

Job Queue

VtTablet MySQL

VtGateVtGateVtGate

RTM Service

RTM ServiceChannel Server

RTM Service

RTM ServiceGateway

Server RTM Service

RTM ServicePresence

ServerRTM

ServiceRTM

ServiceMessage

ServerVtGateVtGateAdmin

Server

RTM Service

RTM Service

us_west_1

Consul

MySQLMySQL

Flannel Cache

Websocket

HTTP API Calls

Page 54: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

We’re Not Done Yet

Storage POPsGeographically distributed back end

Services Services ServicesDecompose the monolith and improve service mesh.

Job QueueRevamp the asynchronous task queue

ResiliencyDegraded functionality when subsystems are unavailable

Eventual ConsistencyChange API expectations

Network ScaleStay ahead of the growth curve

Page 55: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Thank You!

55

Page 56: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

BACKUP

Page 57: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

us_east_1

How Slack Works (c 2018)

Client

Websocket

HTTP API Calls WebappWebappWebapp

Job Queue

VtTablet MySQL

VtGateVtGateVtGate

RTM Service

RTM ServiceChannel Server

RTM Service

RTM ServiceGateway

Server RTM Service

RTM ServicePresence

ServerRTM

ServiceRTM

ServiceMessage

ServerVtGateVtGateAdmin

Server

RTM Service

RTM Service

us_west_1

Consul

MySQLMySQL

Flannel Cache

Page 58: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Client ConnectionsWebsocket termination, user / connection state and subscriptions

Webapp ActionsCommunication/routing from Webapp → Message Server for channel messages

Presence IndicationsUser presence state, updates & presence subscriptions - that little green indicator

Subscriptions and FanoutLast 5 minutes of history, as well as initial subscription and fanout of messages

Message Server

Scheduled MessagesUsed for reminders, Google Calendar integration RTM ServiceRTM ServiceMessage Server

(Java)

Page 59: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

Team Sharded MySQL

Team ShardingApplication-defined sharding policy routes all queries to the team shard

Manual Topology ManagementOperator-managed host configuration is injected into application code

Active Master / MasterBoth sides are writable masters, biases for availability with best-effort consistency

Application Retry FailoverIf preferred side is unavailable, connect to the backup side and try again

Split ShardsManually orchestrated switchover to divide some teams to new host.

MySQLMySQL

WebappWebappWebapp

Page 60: Michael Demmer - November 16-20, 2020 · Michael Demmer November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead mdemmer@slack-corp.com | @mjdemmer . Me (Not) This

QCon 2016 QCon 2017