Microservices at Netflix Scale

62
Microservices at Netflix Scale First Principles, Tradeoffs, Lessons Learned Ruslan Meshenberg @rusmeshenberg

Transcript of Microservices at Netflix Scale

Page 1: Microservices at Netflix Scale

Microservices at Netflix Scale

First Principles, Tradeoffs, Lessons Learned Ruslan Meshenberg @rusmeshenberg

Page 2: Microservices at Netflix Scale

Microservices: all benefits, no costs?

Page 3: Microservices at Netflix Scale

Netflix is the world’s leading Internet television network with over 81 million members in over 190 countries enjoying more than 125 million hours of TV shows and movies per day, including original series, documentaries and feature films.

Page 4: Microservices at Netflix Scale

Ruslan Meshenberg Director, Platform Engineering

•  Runtime Systems

•  Container Runtime

•  Persistence and Databases

•  Real Time Data Infrastructure

Page 5: Microservices at Netflix Scale

Netflix runs on microservices

Page 6: Microservices at Netflix Scale

Netflix journey to microservices

Page 7: Microservices at Netflix Scale

Our journey took 7 years

https://media.netflix.com/en/company-blog/completing-the-netflix-cloud-migration

Page 8: Microservices at Netflix Scale

Data Center - Monolith

RDBMS

Page 9: Microservices at Netflix Scale

August 2008

Page 10: Microservices at Netflix Scale

First Principles

Page 11: Microservices at Netflix Scale

Buy vs. Build

●  Use or contribute to OSS technologies first

●  Only build what you have to

Page 12: Microservices at Netflix Scale

Services should be stateless*

●  Must not rely on sticky sessions

●  Prove by Chaos testing

*Except the Persistence / Caching layers

Page 13: Microservices at Netflix Scale

Scale out vs. scale up

●  If you keep scaling up, you’ll hit a limit

●  Horizontal scaling gives you a longer runway

Page 14: Microservices at Netflix Scale

Redundancy and Isolation For Resiliency

●  Make more than one of anything

●  Isolate the blast radius for any given failure

Page 15: Microservices at Netflix Scale

Automate destructive testing

●  Simian Army

●  Started with Chaos Monkey

Page 16: Microservices at Netflix Scale

First Principles In Action

Page 17: Microservices at Netflix Scale

Stateless services

Service A

Service B Service B Service B Service B Service B

Page 18: Microservices at Netflix Scale

Verify stateless

Page 19: Microservices at Netflix Scale

Data – from RDBMS to Cassandra

●  NoSQL at scale ●  Open Source ●  Multi-Regional ●  Multi-directional

●  Available ●  Partition Tolerance ●  Tunable Consistency*

Page 20: Microservices at Netflix Scale

Multi-Regional Replication

ZoneA

ZoneB

ZoneC

ZoneB

ZoneC

ZoneA

ZoneA

ZoneB

ZoneC

ZoneC

ClientClient

ZoneA

ZoneB

500ms

Bi-directional Nightly compare & repair

Local Quorum (Typical)

Region A Region B

Page 21: Microservices at Netflix Scale

Last, but not least - Billing

Page 22: Microservices at Netflix Scale

Microservices – Benefits

Page 23: Microservices at Netflix Scale

Our Priorities

1. Innovation

3. Efficiency

2. Reliability

Page 24: Microservices at Netflix Scale

Innovation: tight coupling doesn’t work

Develop •  Team A •  Team B •  Team C •  …

Test Release

Page 25: Microservices at Netflix Scale

Innovation: Loose coupling

Team A Develop,

Test, Deploy, Support

Team B Develop,

Test, Deploy, Support

Team C Develop,

Test, Deploy, Support

Page 26: Microservices at Netflix Scale

Architect

Design

Develop

Review Test

Deploy

Run

Support

End-end ownership

Page 27: Microservices at Netflix Scale

End-end ownership + velocity Architect

Design

Develop

Review Test

Deploy

Run

Support

Architect

Design

Develop

Review Test

Deploy

Run

Support

Architect

Design

Develop

Review Test

Deploy

Run

Support

Architect

Design

Develop

Review Test

Deploy

Run

Support

Architect

Design

Develop

Review Test

Deploy

Run

Support

Architect

Design

Develop

Review Test

Deploy

Run

Support

Architect

Design

Develop

Review Test

Deploy

Run

Support

Architect

Design

Develop

Review Test

Deploy

Run

Support

Architect

Design

Develop

Review Test

Deploy

Run

Support

Architect

Design

Develop

Review Test

Deploy

Run

Support

Architect

Design

Develop

Review Test

Deploy

Run

Support

Architect

Design

Develop

Review Test

Deploy

Run

Support

Page 28: Microservices at Netflix Scale

Separation of concerns

UI Feature A Feature B Feature C

Personalization Feature D A/B Test E

Mid-tier A/B Test F Feature H

Infrastructure Availability Scalability Security

Leve

rage

Page 29: Microservices at Netflix Scale

Microservices – Costs

Page 30: Microservices at Netflix Scale

Microservices Is an org change!

Org changes are hard!

Page 31: Microservices at Netflix Scale

Evolving the organization

Page 32: Microservices at Netflix Scale

Central infrastructure investment

Page 33: Microservices at Netflix Scale

Migration doesn’t happen overnight

●  Living in the hybrid world

●  Supporting 2 tech stacks

●  Double the maintenance

●  Multi-master data replication

Page 34: Microservices at Netflix Scale

Microservices - Lessons Learned

Page 35: Microservices at Netflix Scale

IPC is crucial for loose coupling ●  Common language between the services

●  Establishes the contract of interaction

Page 36: Microservices at Netflix Scale

Caching to protect DBs

1.  Read from Cache 2.  On cache miss call service 3.  Service calls DB and responds 4.  Service updates the cache

Client Application

Client Library

EVCache Client Service Client

S S S S . . .

DB DB DB DB . . .

. . .

Request Cache

Page 37: Microservices at Netflix Scale

Operational visibility matters

If you can’t see it, you can’t improve it

Page 38: Microservices at Netflix Scale

Will your Telemetry scale?

Orient

Decide Act

Observe

Page 39: Microservices at Netflix Scale

Edge

ELB

Zuul

Playback

API

Middle Tier & Platform

EVCache

Cassandra

Page 40: Microservices at Netflix Scale

Reliability Matters ●  We strive for 4 9’s of availability

●  That leaves only 52 minutes of downtime per YEAR

●  Netflix outages lead to…

Page 41: Microservices at Netflix Scale

Disappointment

Page 42: Microservices at Netflix Scale

Outrage

Page 43: Microservices at Netflix Scale

Withdrawal

Page 44: Microservices at Netflix Scale

Humor

Page 45: Microservices at Netflix Scale

Cascading failures

99% availability 99% availability 99% availability

99%500

= 0.0657%

Page 46: Microservices at Netflix Scale

FITFault-Injection

Test Framework

Microservice failure

Page 47: Microservices at Netflix Scale

x x

Regional fail-over

Page 48: Microservices at Netflix Scale

Regional fail-over

Page 49: Microservices at Netflix Scale

A word on containers

●  Containers change the level of encapsulation from VM to process

●  Containers can help deliver great developer experience

●  To run containers in production at scale…

Page 50: Microservices at Netflix Scale

Requires something like this:

Titus UI Titus UI

Docker Registry

Docker Registry

Rhea

container container

container

docker

Titus Agent metrics agent

Titus executor

logging agent

zfs

mesos agent

docker

Rhea Titus API

Cassandra

Titus Master

Job Management & Scheduler

S3

Zookeeper

Docker Registry

50

EC2 Autocaling API

Mesos Master

Titus UI

Fenzo

VPC networking driver

container container container

AWS container metadata proxy

Integration

CI/CD Amazon VM’s

Page 51: Microservices at Netflix Scale

Microservices - Resources

Page 52: Microservices at Netflix Scale

http://netflix.github.com

Page 53: Microservices at Netflix Scale

http://netflix.github.com

Page 54: Microservices at Netflix Scale

http://netflix.github.com

Page 55: Microservices at Netflix Scale

http://netflix.github.com

Page 56: Microservices at Netflix Scale

http://netflix.github.com

Page 57: Microservices at Netflix Scale

http://netflix.github.com

Page 58: Microservices at Netflix Scale

Wrap up

Page 59: Microservices at Netflix Scale

Microservices bring great value to development velocity, availability and other dimensions

Page 60: Microservices at Netflix Scale

Microservices at scale require organizational change and centralized infrastructure investment

Page 61: Microservices at Netflix Scale

Be aware of your situation and what works for you

Page 62: Microservices at Netflix Scale

Questions?

Ruslan Meshenberg @rusmeshenberg