The new Netflix API

Post on 12-Apr-2017

94 views 2 download

Transcript of The new Netflix API

The new Netflix API

Why more complexity must lead to more simplicity

Katharina ProbstDevNexus 2017

Js(mostly)

java

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Netflix Micro-services

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary API Server JVM

groovy

Network boundary

Today’s architectureNetwork boundary

Gateway

What is the Netflix

Raison d’Être

Is the API just one gigantic translation layer?

Is it a routing layer?

If it’s too complex, can we just get rid of it?

Raison d’Être.

1. Orchestration

2. Availability protection

3. Abstraction

Raison d’Être

1. Orchestration

Simple example: search

Related Terms

People

Titles

Search request → response● Search services provides related search terms● Search service provides IDs for videos and people

○ IDs depend on various factors, e.g., different catalogs in different countries

● For each ID, we need metadata○ Titles○ Images○ Names○ Ratings○ etc.

● ..., which depend on○ Country○ A/B tests user is in○ etc.

Response:❏ Hydrated videos❏ People names❏ Query suggestions

Orchestration● Own order of operations● Provide whatever info clients/services need

○ From other clients/libraries/services○ From request

● Merge partial results● Filter results● Retrieve more info if necessary● Support mutations (e.g., profile switch)● Support complex transactions in a limited way

2. Availability protection

Prevent this as much as possible

What do customers want?

● No personalized recommendations, or no ability to stream?● No search, or no ability to continue watching the movie you started last night?● No cutting-edge A/B experiment experience, or no ability to stream?

Top priority: customer experience

● Top priority of top priority: customer can stream videos● This means API cannot go down entirely

○ If it does, we have an outage● But some services are not critical to this mission

○ A/B - if we don’t know what A/B tests you’re in, you can still get the default experience

○ Search - if you can’t search, you can still browse

Exposure to failures

● As your app grows, your set of dependencies is much more likely to get bigger, not smaller

● Overall uptime = (Dep uptime)^(num deps)

● Fault-tolerance pattern as a library

● Provides operational insights in real-time

● Automatic load-shedding under pressure

Hystrix

Search client libClient lib B

Ratings client lib

Client lib N

Cust client libClient lib Z

...

...

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

Availability protection

Search

Ratings

Customers

...

Network boundary

Gateway

API

Search client libClient lib B

Ratings client lib

Client lib N

Cust client libClient lib Z

...

...

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

Availability protection

Search

Ratings

Customers

...

Network boundary

Gateway

API

Search client libClient lib B

Ratings client lib

Client lib N

Cust client libClient lib Z

...

...

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

Availability protection

Search

Ratings

Customers

...

Network boundary

Gateway

API

Search client libClient lib B

Ratings client lib

Client lib N

Cust client libClient lib Z

...

...

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

If you don’t plan for failure

Search

Ratings

Customers

...

Network boundary

Gateway

API

Search client libClient lib B

Ratings client lib

Client lib N

Cust client libClient lib Z

...

...

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

If you do plan for failure

Search

Ratings

Customers

...

Network boundary

Gateway

API

No search results >> no Netflix

Search client libClient lib B

Ratings client lib

Client lib N

Cust client libClient lib Z

...

...

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

Fallbacks

Search

Ratings

Customers

...

Network boundary

Gateway

API

Return static or stale rating

return getRatings(id);

How to handle errors

try {

return getRatings(id);

} catch (Exception ex) {

//static value

return null;

}

How to handle errors

try {

return getRatings(id);

} catch (Exception ex) {

//TODO What to return here?

}

How to handle errors

Handle errors with fallbacks

● Some options for fallbacks

○ Static value

○ Value from in-memory

○ Value from cache

○ Value from network

○ Throw

○ Code

● Make error-handling explicit

● Applications have to work in the presence of either fallbacks or rethrown exceptions

● Throttling

● Retries

● Timeouts

● Canaries

● Regional rollouts

● Traffic shifting

● Outlier detection (and elimination)

● Advanced load balancing

Availability protection beyond Hystrix

3. Abstraction

Abstraction goals

● Shield all device teams from every single mid-tier change … at least for a time. Allows us to move more independently

● Shield all device teams from every single platform/infrastructure change● Provide APIs not provided by downstream services

○ Find all movies that...○ Length of movie

● Implementation flexibility, e.g., ○ Caching○ Batch APIs

Abstraction challenges

● Tech debt● Device teams can have black-box view (“api == cloud”)● But isn’t the API team the bottleneck?

○ Yes, sometimes. But organizational structure makes this less of a problem than m mid-tier teams dealing with n device teams

● But: separation of concerns

Server-side logic

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Netflix Micro-services

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

~2100 active

Network boundary

Reminder: Today’s architectureNetwork boundary

Gateway

API

Device teams write server-side logic

● Decoupling teams → better velocity● UI teams are empowered to

○ Change presentation○ Filter○ Add users to A/B tests, which then leads to e.g., different layout.

What if we didn’t have an API?

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Netflix Micro-services

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

What if? Implications for device teamsNetwork boundary

Gateway

Device teams own client-side applications …

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Netflix Micro-services

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

What if? Implications for device teamsNetwork boundary

Gateway

...and groovy scripts

What if? Implications for device teams

● Each device team would have to own○ Orchestration○ Frequent dependency updates (currently done (attempted) daily)○ Implement higher level APIs (all movies that…)○ Fallbacks and other resiliency protection (e.g., timeouts, retries)

● Recent example○ Library upgrade caused a lot of NPEs -- why? ○ Worked with team to find out why○ When fixed, no more NPEs, but instead performance degradation

● Should all teams be dealing with this?

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Netflix Micro-services

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

What if? Implications for service teamsNetwork boundary

Gateway

Service teams own services...

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Netflix Micro-services

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

What if? Implications for service teamsNetwork boundary

Gateway

...and client libraries

What if? Implications for service teams● Can only make breaking changes if all device teams who use their service

upgrade● Don’t get resiliency protection (e.g., timeouts, load balancing, retries, fallbacks)

unless all device teams who use their service provide it● Should all teams be dealing with this?

What if? Implications for Netflix● Lower velocity due to tight coupling between many mid-tier teams and many

device teams

OR:THE DOWNSIDE OF CENTRALIZATION

Where are we today?

● Principle: don’t repeat logic○ It’s better to do it once in API than do it n times for n devices.

● Principle is good, but leads to complexity

What complexity challenges to we have?

Complexity challenges

● Frequent (not always canaried) updates to a critical service in production● Difficulty of debugging (esp. for groovy script writers)● Slow server startup times● Lack of operational insights into script resource consumption● Difficulty of performance profiling● Lack of feedback loop● Decoupled code versioning and transitive dependencies

Where are we going next?

Top priorities

● Move groovy scripts out● Split up API

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Netflix Micro-services

Network boundary

...

Network boundary

New architecture: Edge PaaSNetwork boundary

Network boundary

Gate-way

EAS

Network boundary Client lib A

Client lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Titus

Network boundary

Network boundary

Netflix Micro-services

Network boundary

...

New architecture: Edge PaaSNetwork boundary

Gate-way

EAS

Network boundary

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Titus

Edge Auth Service● Auth

termination● Centralized

place for auth

Edge PaaS: ● Platform for node scripts● Developer tooling for entire SDLC● Remote API with optimized data access (Falcor)

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Two APIs

DNAClient A

...

Network boundary

...

Network boundary

Two (or more) APIsNetwork boundary

Network boundary

Gate-way

EAS

Network boundary

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Titus

PB Service A

PB Service B

PB Service Z

...

DNAClient B

DNAClient Z

Shared Client C

Shared Client A

...

PB Client B

PB Client Z

PB Client C

PB Service C

DNA Service A

DNA Service B

DNA Service Z

...

DNA Service C

Shared Service A

Shared Service B

Shared Service Z

...

Split API by function

NodeQuark Platform

java

Netflix Micro-services

Network boundary

...

Network boundary

NodeQuark PlatformNetwork boundary

Network boundary

Zuul

EAS

Network boundary

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Titus

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Platform for node scripts

Edge PaaS: Node Platform

● Node apps run in containers on Titus platform● Node Platform provides

○ Integration into Netflix ecosystem (e.g., discovery)○ Logging○ Dashboards, metrics out of the box with option to customize○ Support for mocking and testing

● Titus provides○ Scheduling○ Autoscaling

Developer experience

java

Netflix Micro-services

Network boundary

...

Network boundary

New architecture: Edge PaaSNetwork boundary

Network boundary

Gate-way

EAS

Network boundary

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Titus

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Developer tooling for entire SDLC

Edge PaaS: Developer tooling

● Command line tool for node apps○ Setup○ Starting apps○ Deploying apps

● Local development and debugging of node apps● UI for lifecycle management, e.g., version management● One-click rollouts, one-click rollbacks● Versioning

Remote API

Netflix Micro-services

Network boundary

...

Network boundary

New architecture: Edge PaaSNetwork boundary

Network boundary

Zuul

EAS

Network boundary

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

TitusRemote API with optimized data access

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Edge PaaS: Remote API

● API still takes care of○ Orchestration○ Resiliency protection○ Abstraction

● Optimized access with Falcor○ “RESTful composition” with caching

● Binary transport● Future: channel support

Greater simplicity

Isolated failures: Scripts don’t affect each other (usually)

API

Temporarily unavailable!

Independent root causing

API

Latency spike after push: 150ms

Average latency: 10ms

Independent autoscaling

API

Independent insights

API

Average latency: 50ms

Average latency: 10ms

Better regression/performance testing

API

Tests not affected by other scripts eating up resources on the same JVM

Conclusion

Complexity and simplicity

● Product has become much more complex○ Scripts (more scripts, more complex scripts)○ Features○ Number of downstream services to integrate○ More personalization○ etc.

● Complexity of API service is high → Need to optimize for simplicity now○ Process isolation○ Cleaner developer experience

END