Performance and Fault Tolerance for the Netflix API - July 18 2012
Server-Side Scripting The Netflix API Platform for · The Netflix API Platform for Server-Side...
Transcript of Server-Side Scripting The Netflix API Platform for · The Netflix API Platform for Server-Side...
The Netflix API Platform for Server-Side Scripting
Problem identified: new servers aren’t coming up healthy!
Ugh! There’s a problem. Errors from API are up.
Stream starts per second more and more off.
Expected value
Actual value
Finally root-caused!Now restarting all unhealthy servers.
Back to normal!
Stream starts per second also back to normal.
Expected value
Actual value
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
Network boundary
API Server JVM
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network boundary
API Server JVM
~700 active
groovy
[...]
Device1VideoCommon. formatKidsSeason (apiRequest,[...], imageUrl)
[...]
[...]
Device2Common.formatAllSeasons([...])
[...]
[...]
dataPublishingService.getShowFeedbackBuilder(user, video)
[...]
n+3i+4
i+1i+2i+3
i
n+2
n+1
n
k+1
k j
j+1
l
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
script
script
...
script
script
Network boundary
API Server JVM
Strong resiliency with Hystrix
What about resiliency on this side?
groovy
Periodic cleanup
New upload increases memory usage.
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
script
script
...
script
script
Network boundary
API Server JVM
few, small scriptsfewer uploads
groovy
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
script/app
script/app
script/app
script/app
...
script/app
script/app
script/app
script/app
Network boundary
API Server JVM
script/app
script/app
~700 more complex scripts/apps,10-50 uploads per day
groovy
→
→
Lack of process isolation is a growing risk.
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
node script
node script
...
node script
node script
Network boundary API Server JVM
node.js
process isolation
API
Temporarily unavailable!
API
Docker Machine
localproject
Local Container
live reload file watcher
docker build / run
File watcher agent
Proxy
NetworkAgent
node-inspector
debugger
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
script
script
script
script
...
script
script
script
script
Network boundary
API Server JVM
script
script
Problems hard to root cause, hard to measure/optimize performance
groovy
API
device server-side script
device client
API
device server-side script
ATLAS
VersioningEasy access to instances
Rollback
Client AClient BClient CClient E
Netflix Microservicesnode script
Network boundary API Server JVM
Client AClient BClient CClient E
Netflix Microservicesnode script
Network boundary API Server JVM
Memory leak makes RSL blow up. Clearer idea of where the problem is.
node.js
Client AClient BClient CClient E
Netflix Microservicesnode script
Network boundary API Server JVM
Same with node script.
Js(mostly)
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
node script
node script
...
node script
node script
Network boundary API Server JVM
node.js