Netflix and Containers: Not A Stranger Thing
-
Upload
aspyker -
Category
Technology
-
view
914 -
download
0
Transcript of Netflix and Containers: Not A Stranger Thing
![Page 1: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/1.jpg)
and containers
Andrew Spyker (@aspyker) - Engineering Manager
Not
![Page 2: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/2.jpg)
About Netflix
● 86.7M members● A few thousand employees● 190+ countries● > ⅓ NA internet download traffic● 500+ Microservices● Many 10’s of thousands VM’s● 3 regions across the world
![Page 3: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/3.jpg)
Netflix has a elastic, cloud native, immutable microservice architecture using full devops built on VM’s!
3
Why are we messing around with containers?
![Page 4: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/4.jpg)
Technical motivating factors for containers
● Simpler management of compute resources
● Simpler deployment packaging artifacts for compute jobs
● Need for a consistent local developer environment
![Page 5: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/5.jpg)
Sampling of realized container benefits
Media Encoding - encoding research development time● VM’s platform to container platform - 1 month vs. 1 week
Continuous Integration Testing● Build all Netflix codebases in hours● Saves development 100’s of hours of debugging
Edge Re-architecture using NodeJS● Focus returns to app development● Simplifies, speeds test and deployment
5
![Page 6: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/6.jpg)
Batch applications
![Page 7: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/7.jpg)
Multi-tenant (cgroups/Mesos) historically used for batch
Linux cgroups
![Page 8: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/8.jpg)
What do batch users want?
● Simple shared resources, run till done, job files
● NOT○ EC2 Instance sizes, autoscaling, AMI OS’s
● WHY○ Offloads resource management ops, Simpler
![Page 9: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/9.jpg)
Introducing Titus
Batch
Job Management
Resource Management & Optimization
Container ExecutionIntegration
Workflow, Data Analysis, Adhoc Upstream Systems
![Page 10: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/10.jpg)
Netflix Batch Job Examples
● Algorithm Model Training (with GPU’s)
![Page 11: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/11.jpg)
Netflix Batch Job Examples
● Media Encoding
● Digital Watermarking
1 1
![Page 12: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/12.jpg)
Netflix Batch Job Examples
Open Connect CDN Reporting
AdhocReporting
![Page 13: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/13.jpg)
● Docker helped generalize use cases● Scheduling required (GPU, elastic)● Initially ignored failures (with retries)● Time sensitive batch came later
Lessons Learned from Batch
![Page 14: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/14.jpg)
Current Container Usage - Batch● 1000’s of container hosts (g2, m4, r3 instances)● 1000’s containers / hour average● Large spikes of CI testing and Digital Watermarking
From day of 10/26● 47K containers● Bursts of 1000
containers in a minute
![Page 15: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/15.jpg)
Service applications
![Page 16: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/16.jpg)
Why Services in containers?
Theory Reality
Developer
![Page 17: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/17.jpg)
Opportunities to evolve our baking
● Java focused supported AMI, baking works well!
● However, wanted to allow○ other stacks to evolve independent of OS updates○ simplified builds (vs. Java and OS based tooling)○ reliable smaller instances for dynamic languages○ ability to develop locally with same image
![Page 18: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/18.jpg)
Services are just long running batch?
ServicesJob Management
Resource Management & Optimization
Container ExecutionIntegration
Service Apps
Batch
![Page 19: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/19.jpg)
19
Nope, not that easy - Titus Details
19
Titus UITitus UI
Docker RegistryDocker Registry
Rhea
containercontainer
container
docker
Titus Agent metrics agent
Titus executor
logging agent
zfs
Mesos agent
docker
RheaTitus API
Cassandra
Titus Master
Job Management & Scheduler
S3
ZookeeperDocker Registry
EC2 Autocaling API
Mesos Master
Titus UI
Fenzo
container
Pod & VPC network drivers
containercontainer
AWSmetadata proxy
Integration
AWS VM’sCI/CD
![Page 20: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/20.jpg)
Services more complex● Services resize constantly and run forever
○ Autoscaling○ Hard to upgrade underlying hosts
● Require IPC integration○ Routable IPs, service discovery○ Ready for traffic vs. just started/stopped
● Existing well defined dev, deploy, runtime & ops tools
![Page 21: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/21.jpg)
Real networking is hard
![Page 22: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/22.jpg)
Multi-tenant
Need IP per container - in VPC
Using security groups
Using IAM roles
Considering network resource isolation
![Page 23: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/23.jpg)
Enabling VPC Networking
No IP, SecGrp A
Task 0
SecGrp Y,Z
Task 1 Task 2 Task 3
Titus EC2 Host VMeth1
ENI1SecGrp=A
eth2
ENI2SecGrp=X
eth3
ENI3SecGrp=Y,Z
IP 1IP 2
IP 3
pod root
veth<id>
app
SecGrp X
pod root
veth<id>
app
SecGrp X
pod root
veth<id>
appapp
veth<id>
Linux Policy BasedRouting + Traffic Control
TitusEC2
Metadata Proxy
169.254.169.254IPTables NAT (*)
* **
169.254.169.254Non-routable IP
*
![Page 24: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/24.jpg)
Solutions● VPC Networking driver
○ Supports ENI’s - full IP functionality○ Scheduled security groups○ Support traffic control (resource isolation)
● EC2 Metadata proxy○ Adds container “node” identity○ Delivers IAM roles
![Page 25: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/25.jpg)
Reuse existing infrastructure services
VMVM
EC2
AW
S A
utoS
cale
rVMs
App
Cloud Platform(metrics, IPC, health)
VPC
Netflix Cloud Infrastructure (VM’s + Containers)
Atlas Eureka Edda
![Page 26: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/26.jpg)
Enable them for containers
VMVM
EC2
AW
S A
utoS
cale
rVMs
App
Cloud Platform(metrics, IPC, health)
VPC
Netflix Cloud Infrastructure (VM’s + Containers)
VMVM
Atlas
Titu
s Jo
b C
ontro
l
Containers
App
Cloud Platform(metrics, IPC, health)
Eureka Edda
VMVM
BatchContainers
![Page 27: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/27.jpg)
Spinnaker
![Page 28: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/28.jpg)
Deploy based on new images
tags
![Page 29: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/29.jpg)
Basic resource requirements
IAM Roles & Sec Groups per container
Deploy Strategies
Same as VM’s
![Page 30: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/30.jpg)
Easily see health &
discovery
![Page 31: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/31.jpg)
![Page 32: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/32.jpg)
![Page 33: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/33.jpg)
Secure Multi-tenancyCommon to VM’s and tiered security needed● Protect the reduced host IAM role, Allow containers to have specific IAM
roles● Needed to support same security groups in container networking as VM’s
User namespacing● Docker 1.10 - Introduced User Namespaces
● Didn’t work /w shared networking NS● Docker 1.11 - Fixed shared networking NS’s
● But, namespacing is per daemon, Not per container, as hoped● Waiting on Linux
● Considering mass chmod / ZFS clones
![Page 34: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/34.jpg)
Titus Advanced Scheduling
● Support for AZ balancing● Multiple instance types selected based on workload● Elastic underlying common resource pool
○ Bin packing managed transparently across all apps● Hard and soft constraints● Resource affinity and task affinity● Capacity guarantees (critical tier)
34
![Page 35: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/35.jpg)
Fenzo - Keep resource scheduling extensible
Fenzo - Extensible Scheduling Library
Features:● Heterogeneous resources & tasks● Autoscaling of mesos cluster
○ Multiple instance types● Plugins based scheduling objectives
○ Bin packing, etc.● Plugins based constraints evaluator
○ Resource affinity, task locality, etc.● Scheduling actions visibility
![Page 36: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/36.jpg)
Current Container Usage - Service
● Still small ~ 2000 long running containers
● NodeJS Device UI Scripts Apps● Stream Processing Jobs - Flink● Various Internal Dashboards
![Page 37: Netflix and Containers: Not A Stranger Thing](https://reader031.fdocuments.net/reader031/viewer/2022030317/5870d90a1a28ab64768b7395/html5/thumbnails/37.jpg)
Questions?
Andrew Spyker (@aspyker) - Engineering Manager